6.033, MIT's class on computer systems, has as one of its catchphrases, "Complex systems fail for complex reasons". As a class about designing and building complex systems, it's a reminder that failure modes are subtle and often involve strange interactions between multiple parts of a system. In my own experience, I've concluded that they're often wrong. I like to say that complex systems don't usually fail for complex reasons, but for the simplest, dumbest possible reasons – there are just more available dumb reasons. But sometimes, complex systems do fail for complex reasons, and tracking them down does require understanding across many of the different layers of abstraction we've built up. This is a story of such a bug.
The following code snippet in Python is intended to extract and return
a single file from a tarball. It probably should be using
tarfile, but let's ignore that for the moment.
1 2 3 4 5 6 7 8 9
This code has a bug. It will often work just fine, but occasionally it
will fail, with 'err' containing a message including
Broken pipe. If, however you were to write the equivalent code in a
you would find that it never fails in this way. So what's going on?
When we launch
tar with the
-z option, it creates a
forks off a
gzip process writing into one end of the pipe. It then
reads the uncompressed tarball from the other end of the pipe, parsing
tar headers, until eventually it's done, and it closes the read
end of the pipe.
Since we've given
tar a specific path to extract, it doesn't need to
read the entire tarball – only up until it can find that file. So it
may close the pipe before reading the entire file, and,
gzip is done writing to it. As explained in
If all file descriptors referring to the read end of a pipe have been closed, then a write(2) will cause a SIGPIPE signal to be generated for the calling process. If the calling process is ignoring this signal, then write(2) fails with the error EPIPE.
Under normal circumstances,
gzip expects that whoever is downstream
of it may only care about a prefix of the uncompressed stream, and so
it registers a
SIGPIPE handler which exits cleanly.
Python, however, doesn't want to get
SIGPIPEs. Instead, Python would
rather just check the return value of every
write call it makes, and
IOError if necessary, so that Python code gets the error in
an appropriately Pythonic way, instead of through an asynchronous
signal. And so, at startup, Python uses
SIGPIPE by setting it to
As explained in
A child created via fork(2) inherits a copy of its parent's signal dispositions. During an execve(2), the dispositions of handled signals are reset to the default; the dispositions of ignored signals are left unchanged.
And so, when started from Python,
gzip starts up with
SIGPIPE ignored. And, for reasons I don't understand, rather than
gzip first checks whether or
not it's ignored, and only installs a handler if the signal is not
SIGPIPE continues to be ignored, which means that
EPIPE, which gzip sees is nonzero, calls
on, and then exits. tar's
wait then sees
gzip exit uncleanly,
which causes tar itself to exit uncleanly, which Python then raises as
There's an easy workaround, which is to re-enable SIGPIPE in the
1 2 3 4 5
But who would think of doing that, without first having seen this horribly subtle chain of bug, and having to track down what went wrong?
(Here's the Python bug report, and many thanks to cjwatson for posting his discovery of this class of bug on his blog, which greatly reduced the amount of time I would have had to spend tracking this down)