In my day job at Anthropic, we run relatively large distributed systems to train large language models. One of the joys of using a lot of computing resources, especially on somewhat niche software stacks, is that you spend a lot of time running into the long-tail of bugs which only happen rarely or in very unusual configurations, which you happen to be the first to encounter.
These bugs are frustrating, but I also often enjoy them. They’re the opportunity for a good murder mystery, and they make for great stories if you ever track them down. This is the story of a bug we recently put to rest; a bug which, from our very first inklings of its existence, I repeatedly described as “cursed.” Now that we understand it, I want to share just how delightfully cursed it was.
Usually I like to tell these debugging stories in the “murder mystery” style, where I walk through the story of identifying and debugging the bug. In this case, I think the bug is more interesting than the debugging, so I want to focus in on the bug itself.
Why was this bug “cursed”? 🔗︎
To begin with: Why do I call this bug “cursed”? I use that term as shorthand to refer to some combination of the following facts about this bug:
- It manifested as “this should never happen” failures extremely far-removed from the underlying bug — both in terms of happening long after the buggy code had executed, and in terms of the manifesting showing up many layers of abstraction higher in the stack than the cause.
- It manifested nondeterministically, and rarely – it manifested symptomatically only in around 0.1% of affected processes.
- It only manifested at all on our very largest distributed jobs, resulting in a very slow (and expensive!) debugging cycle.
- The bug (we eventually discovered) originated in the RDMA software stack we use. RDMA is always a bit spooky — the whole point being to allow (certain) memory to be transparently manipulated from a remote host without local intervention. And the RDMA software stack is complex and many-layered and has, in my experience, a checkered reputation for quality, at best.
The bug 🔗︎
The bug manifested as follows: In our very largest distributed training jobs, sometimes — on less than 1% of nodes — attempts to launch subprocesses would fail in very strange ways. These failures sometimes manifested as Python exceptions, something like so:
… File "utils/command.py", line 89, in call proc = subprocess.run( File "lib/python3.8/subprocess.py", line 493, in run with Popen(*popenargs, **kwargs) as process: File "lib/python3.8/subprocess.py", line 858, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) OSError: [Errno 14] Bad address: 's5cmd'
And sometimes as segmentation faults coming from the child process in between
exec, with stack traces something like:
#0 __malloc_fork_unlock_child () at arena.c:193 #1 0x00007fe2a996fab5 in __libc_fork () at ../sysdeps/nptl/fork.c:188 #2 0x00007fe2aa6e3941 in subprocess_fork_exec (self=<optimized out>, args=<optimized out>) at /usr/local/src/conda/python-3.8.10/Modules/_posixsubprocess.c:693 #3 0x0000559a764598cf in cfunction_call_varargs (kwargs=<optimized out>, args=<optimized out>, func=0x7fe2a9704810) at /home/builder/ktietz/cos6/ci_cos6/python_1622837047642/work/Objects/call.c:758 #4 PyCFunction_Call () at /home/builder/ktietz/cos6/ci_cos6/python_1622837047642/work/Objects/call.c:773 #5 0x0000559a7641ada8 in _PyObject_MakeTpCall.localalias.6 () at /home/builder/ktietz/cos6/ci_cos6/python_1622837047642/work/Objects/call.c:159 #6 0x0000559a764796f8 in _PyObject_Vectorcall (kwnames=0x0, nargsf=<optimized out>, args=0x559a773b7ff8, callable=<optimized out>) at /home/builder/ktietz/cos6/ci_cos6/python_1622837047642/work/Include/cpython/abstract.h:125
In either case, upon further inspection, we discovered that seemingly-random individual pages of memory were just missing from the heap in the
forked child! The bad address was a perfectly-reasonable pointer, and the next and previous pages of memory existed, but one page was just … gone.
[Errno 14] Bad address 🔗︎
The errno (“error number”) 14 in that first stack trace is
EFAULT. The Linux kernel returns
EFAULT if you pass an invalid userspace address to a system call. In that way, it can sometimes indicate a similar failure mode as a segmentation fault, except that where a segmentation fault arises from trying to directly read or write to a bad pointer,
EFAULT arises from trying to pass a similarly bad pointer to the OS kernel.
What’s the deal with
__malloc_fork_unlock_child? As mentioned above,
fork() creates a new process with an exact copy of the memory image of the calling process. The
fork system call predates multithreading in UNIX, and the introduction of multithreading posed some nasty challenges.
If a process is multi-threaded, all threads share the same memory. If one thread makes a copy of the entire memory image using
fork, it has no way — in general — of making any guarantees about what the other threads are doing at the moment. They could be in the middle of mutating any data structures, or could be holding locks, which will then forever be held in the forked copy.
One of the most core bits of global shared state in most Linux processes is the memory allocator, generally referred to by the name of its main entry point —
malloc. If we were to
fork while another thread held a
malloc-related lock, there would be no safe way to ever perform memory allocation again in the child. Thus, the C library, in its userspace wrapper around
fork, walks the
malloc heap metadata and (approximately speaking) takes out locks on every single bit of shared data. Then, after the
fork, both the parent and the child release all of those locks, guaranteeing that the child sees a consistent snapshot of the allocator metadata.
Among other side effects, this means that
fork, of necessity, looks at the metadata header for every region of memory (called “arenas” in glib) managed by
malloc. When the child segfaulted in
__malloc_fork_unlock_child, it was because the page of memory containing one of those headers had vanished in the child.
If you’re not familiar with how creating subprocesses works on UNIX, it’s an odd two-part dance under the hood. The only1 way to create a new process on Linux is with the
fork system call, which creates a (near-)identical copy of the current process, including its entire memory image (
fork uses virtual memory tricks and copy-on-write to make this reasonably efficient). If you want to start an external process, you then arrange for the child to call
execve (colloquially referred to as just
exec), which then replaces the entire memory and state of the child process with the new binary loaded from disk.
This two-step results in a (brief) window where the child process is running code in between
fork returning and
exec nuking the entire state. This window is fraught in a number of ways, and it is during this window that we were seeing these failures arise.
Why were the pages gone? 🔗︎
These pages were missing because of an interaction between aws-ofi-nccl, an AWS plugin which lets NVIDIA’s NCCL GPU communication library use Amazon’s EFA for high-performance communication between EC2 instances, and the
libiverbs userspace library for RDMA on Linux.
libiverbs marks any memory segments which have been registered for use with RDMA using
madvise(…, MADV_DONTFORK), to ask the kernel not to copy them into the child on
aws-ofi-nccl registered a 4-byte region in the middle of a
malloc ed buffer as an RDMA target. Instead of rejecting the request (since there is no way to manipulate
madvise flags at a granularity of smaller than a page),
libiverbs would “helpfully” round up to the nearest full page, which resulted in an effectively random page vanishing from the child’s heap for each NCCL communicator! Most of the time, these pages would go unused in between
exec and so this behavior would go unnoticed; but sometimes they would contain part of glibc’s
malloc metadata, or would contain string data used by Python between the
exec, manifesting in the two failure patterns described above!
In the process of debugging the issue, we were able to catch a “live” process in the broken state (where
forking would fail). By looking at the address of the missing malloc arena and inspecting the parent’s address space in
/proc/pid/smaps, we could observe the problematic state:
7f5e20000000-7f5e20001000 rw-p 00000000 00:00 0 Size: 4 kB KernelPageSize: 4 kB MMUPageSize: 4 kB Rss: 4 kB Pss: 4 kB Shared_Clean: 0 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 4 kB Referenced: 4 kB Anonymous: 4 kB LazyFree: 0 kB AnonHugePages: 0 kB ShmemPmdMapped: 0 kB FilePmdMapped: 0 kB Shared_Hugetlb: 0 kB Private_Hugetlb: 0 kB Swap: 0 kB SwapPss: 0 kB Locked: 0 kB THPeligible: 0 ProtectionKey: 0 VmFlags: rd wr mr mw me dc nr sd
Looking at this mapping, we notice two salient facts:
- This is a 1-page (4k) range of memory. A page is the smallest region of memory that can be managed independently by the hardware virtual memory system. Most of the surrounding regions were much larger than a single page, because memory tends to get managed in much larger chunks on today’s large systems. The
madvisecall forced the kernel to split off a one-page region to track the effects of the
VmFlagsfield includes the
dcflag. Searching for
man 5 proc, we find that
dcmeans “do not copy area on fork” — the internal flag set by
Aside: Why does
libiverbs exists to interact with RDMA hardware. RDMA hardware, like most forms of DMA, accesses memory in terms of physical memory addresses, unlike the virtual addresses code running on the CPU uses. This means, in order for things to go well, that any page of memory involved in RDMA needs to be pinned to a specific location in physical memory as long as it might be written to by a remote RDMA peer.
RDMA drivers “pin” memory pages, which prevents them from being relocated or swapped by the kernel. However,
fork calls are now problematic. Following a
fork, the kernel usually marks both processes’ entire address space as read-only; on an attempt to write to memory, it copies the underlying page so that neither process sees any modifications by the other.
Specifically, the first process to attempt to write to a copy-on-write (“CoW”) page will make a copy, leaving the original page (still visible to one or more other processes) intact.
This (potentially) interacts poorly with RDMA; If a process using RDMA
forks, any future attempts to write to pages used for RDMA would result in the parent copying the page, which would mean it would no longer see any remote writes to that page by RDMA peers.
Historically, to avoid this problem, the RDMA user-space would handle this case by explicitly requesting that any pages which are RDMA targets not get copied into the child at all. This ensures the parent process continues functioning properly, and (assuming the child then
execves to replace its address space) should normally be harmless in the child.
The future: eager copying 🔗︎
While doing this writeup, I discovered that the previous section is no longer accurate for recent kernels! As of this commit (released as part of Linux 5.12), Linux will eagerly copy pages which are marked as pinned on a
fork call. This has the result that the parent process gets to hold on to the same physical page, but the child gets a copy and has its address space intact.
rdma-core, as of version v35, has support for checking whether or not a kernel is sufficiently new to support this feature. Confusingly to me, this check takes the form of a new API call —
ibv_is_fork_initialized — and the existing call to request “fork support”,
ibv_fork_init, was unmodified. Thus, on newer kernels, callers are able to detect that
MADV_DONTFORK is unneeded and avoid enabling; but if someone explicitly enables, including by setting the
RDMAV_FORK_SAFE environment variable (as we were), they will get the legacy (and buggy)
MADV_DONTFORK behavior even if it is unneeded.
It turns out our kernels were too old to have this support, but even if they weren’t, the fact that our startup scripts set
RDMAV_FORK_SAFE (presumably blindly copy-pasted on advice of some historical reference), it would not have shielded us from this bug.
The fix 🔗︎
The workaround 🔗︎
Once we realized the problem was related to
fork, one natural workaround was to attempt spawning new processes via
posix_spawn, instead of
posix_spawn is a (relatively) new API that attempts to encapsulate the creation of a subprocess into a single API call, instead of the
exec combination traditionally used. Under the hood,
posix_spawn is implemented in most
libcs I know of using the
vfork system call.
vfork is a call similar to
fork, but instead of actually copying memory, it pauses the parent until the child calls
execve or exits. This creates the risk of memory corruption if the child attempts to write to essentially any memory whatsoever; but if the child is properly-implemented,
exec can be substantially more performant that the traditional
We were launching subprocesses using Python’s
subprocess module. As of Python 3.8 (which we were running),
posix_spawn, so we were actually a little bit surprised to discover it wasn’t being used. It turns out that
subprocess only hits the
posix_spawn path if a long list of conditions holds on process creation. Notably, among other conditions, it requires that
subprocess.Popen be given an absolute path to the program to execute, and that the
close_fds flag be set to false. Since
close_fds=True by default, virtually every Python program in existence is skipping the
Once we realized this (by reading
subprocess.py — we found no documentation of this gotcha), we were able to modify our subprocess creation code to meet those conditions, and the bug stopped manifesting! We were able to deploy this workaround several days before we actually identified the root bug, and resume training stably.
The actual fix 🔗︎
Once we had identified the fix was related to
MADV_DONTFORK, our AWS support contacts were able to (impressively quickly!) identify the troublesome registration in
aws-ofi-nccl, and fairly quickly pushed a fix that resolved the problem.
I also reported a bug to the rdma-core project asking that they error (or warn) on these registrations instead of blithely trashing the processes’ address space, but I received a (frankly) kind of baffling response. In light of my note above about eager copying in
fork, I think they may be referring to the fact that
ibv_fork_init is no longer needed on recent systems, but in light of the fact that that mode is still supported and is fairly easy to unintentionally enable, it still seems clear to me that there’s a bug worth fixing.
Closing Thoughts 🔗︎
I don’t feel like I have any great takeaways here. This was a very satisfying bug to finally track down and fix; it was the kind of bug that can linger, rarely triggered, never diagnosed, for years, worked around or tolerated, but never debugged. We got lucky, in a way, that we were able to track it down; a completely unrelated update in a library we used caused us to start shelling out early on in every training process’ lifetime, which was sufficient to increase the rate of incidence of this bug high enough that it became unbearable; but that also meant it was high enough to be reproducible and debuggable. For months prior to that we’d just tolerated occasional incidences, working around with retry loops at various levels.
The experience also feels like a vindication of a sorts for my conviction that computers can be understood. When you start dealing with large distributed systems, you see a lot of intermittent weird bugs, and it’s basically inevitable that you’ll never track down or diagnose all of them. But it turns out that sometimes, with a bit of luck and persistence, even the hairiest ones can sometimes be tracked down.
This sentence isn’t literally true — there are some variants on
fork, such as
clone— but it’s true enough in the sense that all the variants are spiritually very similar. ↩︎