Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I’m going to do a brief tour of the various kernel features that collided to make this bug possible, and explain how they combine to turn an otherwise-boring oops into privilege escalation.
access_ok
🔗︎
When a user application passes a pointer to the kernel, and the kernel wants to read or write from that pointer, the kernel needs to perform various checks that a buggy or malicious userspace app hasn’t passed an “evil” pointer.
Because the kernel and userspace run in the same address space, the most important check is simply that the pointer points into the “userspace” part of the address space. User applications are protected by page table permissions from writing into kernel memory, but the kernel isn’t, and so must explicitly check that any pointers given to it by a user don’t point into the kernel region.
The address space is laid out such that user applications get the
bottom portion, and the kernel gets the top, so this check is a simple
comparison against that boundary. The kernel function that performs
this check is called access_ok
, although there are various other
functions that do the same check, implicitly or otherwise.
get_fs()
and set_fs()
🔗︎
Occasionally, however, the kernel finds it useful to change the rules for what
access_ok
will allow. set_fs()
1 is an internal Linux function that is used to
override the definition of the user/kernel split, for the current process.
After a set_fs(KERNEL_DS)
, no checking is performed that user pointers
point to userspace – access_ok
will always return
true. set_fs(KERNEL_DS)
is mainly used to enable the kernel to wrap
functions that expect user pointers, by passing them pointers into the
kernel address space. A typical use reads something like this:
old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &pos);
set_fs(old_fs);
vfs_readv
expects a user-provided pointer, so without the set_fs()
, the
access_ok()
inside vfs_readv()
would fail on our kernel buffer, so we use
set_fs()
to effectively temporarily disable that checking.
Kernel oopses 🔗︎
When the kernel oopses, perhaps because of a NULL
pointer
dereference in kernelspace, or because of a call to the BUG()
macro
to indicate an assertion failure, the kernel attempts to clean up, and
then tries to kill the current process by calling the do_exit()
function to exit the current process.
When the kernel does so, it’s still running in the same process
context it was before the oops occured, including any set_fs()
override, if applicable. Which means that do_exit
will get called
with access_ok
disabled – not something anyone expected when they
wrote the individual pieces of this system.
clear_child_tid
🔗︎
As it turns out, do_exit
contains a write to a user-controlled
address that expects access_ok
to be working properly!
clear_child_tid
is a feature where, on thread exit, the kernel can
be made to write a zero into a specified address in that thread’s
address space, in order to notify other threads of that exit.
This is implemented by simply storing a pointer to the to-be-zeroed
address inside struct task_struct
(which represents a single thread
or process), and, on exit, mm_release
, called from do_exit
, does:
put_user(0, tsk->clear_child_tid);
This is normally safe, because put_user
checks that its second
argument falls into the “userspace” segment before doing a write. But,
if we are running with get_fs() == KERNEL_DS
, it will happily accept
any address at all, even one pointing into kernel space.
So, if we find any kernel BUG()
or NULL
dereference, or other page
fault, that we can trigger after a set_fs(KERNEL_DS)
, we can trick
the kernel into a user-controlled write into kernel memory!
splice()
et. al. 🔗︎
An obvious question at this point is: How much of the kernel can an
attacker cause to run with get_fs() == KERNEL_DS
?
There are a number of small special cases. For example, the binary
sysctl compatibility code works by calling the normal /proc/
write
handlers from kernelspace, under set_fs()
. handful of compat-mode
(32 on 64) syscalls work similarly.
By far the biggest source I’ve found, however, is the splice()
system call. The splice()
system call is a relatively recent
addition to Linux, and allows for zero-copy transfer of pages between
a pipe and another file descriptor.
As of 2.6.31, attempts to splice()
to or from an fd that doesn’t
support special handling to actually do zero-copy splice
, will fall
back on doing an ordinary read()
, write()
, or sendmsg()
on the
fd … from the kernel, using set_fs() in order to pass in kernel
buffers.
What that means it that by using splice()
, an attacker can call the
bulk of the code in most obscure filesystems and socket types (which
tend not to have explicit splice()
support) with a segment override
in place. Conveniently for an attacker, that is also exactly a
description of where the bulk of the random security bugs tend to be.
This is also exactly the technique Dan’s exploit uses. He uses
CVE-2010-3849, an otherwise boring NULL
pointer dereference I
reported in the Econet network protocol. His exploit code does a
splice()
to an econet socket, causing the econet_sendsmg
handler to
get called under set_fs(KERNEL_DS)
. When it oopses, do_exit
is
called, and he gets a user-controlled write into kernel
memory. Everything else is just details.
Footnotes:
1 Back in Linux 1.x, this function actually set the %fs register on i386. It hasn't in years, but it's used in too many places for changing the name to be worth it.