Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I’m going to do a brief tour of the various kernel features that collided to make this bug possible, and explain how they combine to turn an otherwise-boring oops into privilege escalation.
When a user application passes a pointer to the kernel, and the kernel wants to read or write from that pointer, the kernel needs to perform various checks that a buggy or malicious userspace app hasn’t passed an “evil” pointer.
Because the kernel and userspace run in the same address space, the most important check is simply that the pointer points into the “userspace” part of the address space. User applications are protected by page table permissions from writing into kernel memory, but the kernel isn’t, and so must explicitly check that any pointers given to it by a user don’t point into the kernel region.
The address space is laid out such that user applications get the
bottom portion, and the kernel gets the top, so this check is a simple
comparison against that boundary. The kernel function that performs
this check is called
access_ok, although there are various other
functions that do the same check, implicitly or otherwise.
Occasionally, however, the kernel finds it useful to change the rules for what
access_ok will allow.
set_fs()1 is an internal Linux function that is used to
override the definition of the user/kernel split, for the current process.
set_fs(KERNEL_DS), no checking is performed that user pointers
point to userspace –
access_ok will always return
set_fs(KERNEL_DS) is mainly used to enable the kernel to wrap
functions that expect user pointers, by passing them pointers into the
kernel address space. A typical use reads something like this:
old_fs = get_fs(); set_fs(KERNEL_DS); vfs_readv(file, kernel_buffer, len, &pos); set_fs(old_fs);
vfs_readv expects a user-provided pointer, so without the
vfs_readv() would fail on our kernel buffer, so we use
set_fs() to effectively temporarily disable that checking.
When the kernel oopses, perhaps because of a
dereference in kernelspace, or because of a call to the
to indicate an assertion failure, the kernel attempts to clean up, and
then tries to kill the current process by calling the
function to exit the current process.
When the kernel does so, it’s still running in the same process
context it was before the oops occured, including any
override, if applicable. Which means that
do_exit will get called
access_ok disabled – not something anyone expected when they
wrote the individual pieces of this system.
As it turns out,
do_exit contains a write to a user-controlled
address that expects
access_ok to be working properly!
clear_child_tid is a feature where, on thread exit, the kernel can
be made to write a zero into a specified address in that thread’s
address space, in order to notify other threads of that exit.
This is implemented by simply storing a pointer to the to-be-zeroed
struct task_struct (which represents a single thread
or process), and, on exit,
mm_release, called from
This is normally safe, because
put_user checks that its second
argument falls into the “userspace” segment before doing a write. But,
if we are running with
get_fs() == KERNEL_DS, it will happily accept
any address at all, even one pointing into kernel space.
So, if we find any kernel
NULL dereference, or other page
fault, that we can trigger after a
set_fs(KERNEL_DS), we can trick
the kernel into a user-controlled write into kernel memory!
splice() et. al.
An obvious question at this point is: How much of the kernel can an
attacker cause to run with
get_fs() == KERNEL_DS?
There are a number of small special cases. For example, the binary
sysctl compatibility code works by calling the normal
handlers from kernelspace, under
set_fs(). handful of compat-mode
(32 on 64) syscalls work similarly.
By far the biggest source I’ve found, however, is the
system call. The
splice() system call is a relatively recent
addition to Linux, and allows for zero-copy transfer of pages between
a pipe and another file descriptor.
As of 2.6.31, attempts to
splice() to or from an fd that doesn’t
support special handling to actually do zero-copy
splice, will fall
back on doing an ordinary
sendmsg() on the
fd … from the kernel, using set_fs() in order to pass in kernel
What that means it that by using
splice(), an attacker can call the
bulk of the code in most obscure filesystems and socket types (which
tend not to have explicit
splice() support) with a segment override
in place. Conveniently for an attacker, that is also exactly a
description of where the bulk of the random security bugs tend to be.
This is also exactly the technique Dan’s exploit uses. He uses
CVE-2010-3849, an otherwise boring
NULL pointer dereference I
reported in the Econet network protocol. His exploit code does a
splice() to an econet socket, causing the
econet_sendsmg handler to
get called under
set_fs(KERNEL_DS). When it oopses,
called, and he gets a user-controlled write into kernel
memory. Everything else is just details.
1 Back in Linux 1.x, this function actually set the %fs register on i386. It hasn't in years, but it's used in too many places for changing the name to be worth it.