The Issue
For the past few years, I’ve been using jemalloc
to allocate memory in my
experiments, partly because of the usefulness of arena allocation. In using
jemalloc
, when you allocate memory to an arena, an actual allocation may or
may not happen, depending on if that arena requires more memory or not. Because
these arenas allocate memory in multiple-megabyte chunks (or extents
, as
they’re called in jemalloc
), this allocation does not necessarily happen each
time you allocate to the arena. Instead, jemalloc
allocates an extent only
when needed, then holds onto that extent until the arena is freed or until some
other heuristic determines that it should be freed.
However, for my research, I need to be able to bind all arena memory to a
specific NUMA node; that is, for every extent allocation, I need to call
mbind
on that range of addresses. To do something like this, jemalloc
provides the extent_hook
interface. Briefly, this feature allows you to
define a set of function pointers that will be used for all memory operations
in the arena: allocating, deallocating, committing, etc. Notably, we define
a function called sa_alloc
, which allocates a new extent to an arena.
sa_alloc
accepts quite a few arguments, including the size of the extent that
it should allocate (size
), and the number of bytes that it should align that
allocation by (alignment
). The first argument is handled easily: simply call
mmap
and pass it the correct size. The second argument is a little trickier,
but this is how the original code did it. I’ll remove the error handling and
other distracting elements. This code was also not originally written by me, so
I’ve added my own comments:
/* Do the initial allocation */
ret = mmap(new_addr, size, PROT_READ | PROT_WRITE, mmflags, sa->fd, sa->size);
/* Check if the allocation meets the alignment */
if (alignment == 0 || ((uintptr_t) ret)%alignment == 0) {
goto success;
}
/* If it's not aligned properly, unmap and try again with a larger size */
munmap(ret, size);
size += alignment;
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, mmflags, sa->fd, sa->size);
/* Chop off and unmap the excess */
n = (uintptr_t) ret;
m = n + alignment - (n%alignment);
munmap(ret, m-n);
ret = (void *) m;
/* Finally, call 'mbind' on the new extent */
success:
mbind(ret, size, mpol, nodemaskp, maxnode, MPOL_MF_MOVE) < 0);
For several years, this worked for a variety of applications and never threw
any errors. However, only recently and in certain situations, it began
throwing extremely rare errors for me in one particular application: AMG.
Specifically, the final mbind
call would return Bad address
, the function
would fail, and the runtime library would fail to allocate to an arena. One
other thing that I should note is that while perror
prints out Bad address
,
it’s possible that the error is caused by some of the other arguments. Poking
through the do_mbind
function in the Linux kernel, it seems as if there are a
great deal of other things that could cause that same error, being EFAULT
.
The Search
To begin with, I didn’t even consider the possibility that this code could be
flawed. After all, it succeeded for several years before this, and I’d never
seen it throw an error in the past. Nearly every day, I would run experiments
that required this block of code, yet only now was it failing. After quickly
printing out the arguments to mbind
(they looked entirely ordinary), I
started my search with what I considered to be more error-prone parts of the
codebase.
Being an error that happens every so often, my first though was a race
condition: something, I thought, must be interfering with this allocation in
some way, perhaps changing some of the heap-allocated structures from which I
get some of the arguments to mbind
. However, all of these variables are
protected by an arena-wide mutex, so it’s not possible for two threads to
allocate to a single arena simultaneously.
The second possibility that I considered is that one of the other threads in
the runtime library was causing it: sometimes, I might run the application with
a profiling thread or a jemalloc
background thread disabled, and it would
succeed. Because of this, I theorized, it might be that thread in the runtime
library that’s messing with the heap in some way. However, after running it
several more times, it would eventually fail with those threads disabled.
The last thing that I considered was that something had gone awry with onlining
some of the memory nodes. As in my previous post, I had some difficulty with
some of the blocks of memory on my system being ZONE_MOVABLE
. This being a
recent problem, coupled with the fact that the error failed more consistently
when memory spilled onto other NUMA nodes, convinced me that mbind
wasn’t
able to allocate to certain regions of one of the NUMA nodes. That, however,
was quickly debunked by simply binding to different nodes, which resulted in
the same issue.
The Solution
Finally, I took a harder look at the alignment code: were there rare situations
in which it could fail, depending on some race condition? Printing out m
,
n
, ret
, etc., I suddenly realized the issue: after failing to get the
correct alignment, the code adds alignment
to the size
. Then, after
adjusting the pointer and unmapping the excess, size
remains the same. This
then gets passed to mbind
, which is looking for a block of memory of size
size
. However, because part of this allocated block was subsequently unmapped,
the size needs to be updated to reflect that.
Now, why did it take so long for this issue to finally crop up? How could I run experiments for several years without experiencing this issue even once? I think that in order for this issue to actually occur, a perfect storm of conditions must be true:
- The application must request a large amount of memory aligned by a size larger than the page size.
- The application must be extremely heavily threaded.
- The application must use lots of allocations and deallocations from those threads.
The vast majority of applications that I run don’t necessarily need aligned
memory, or require it to be aligned by pages (which is always true), so that
code very rarely even runs. When it does run, however, I suspect that the
majority of the time, the munmap
might not actually do anything: that is,
mmap
will give more memory than is required, or the small size of the
allocation means that the munmap
doesn’t see a reason to actually deallocate
a particular set of pages. When it does, though, it’s possible that the
application itself still holds the memory that was originall allocated by the
second mmap
, and thus mbind
won’t have any trouble binding that memory to
another node. I think, in order for the issue to crop up, an application must
have a large number of threads constantly allocating and deallocating from
various arenas, causing some fragmentation in the heap and eventually causing
holes which have a chance of not being allocated by other threads in the
application.