<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>No Name Blog &#187; performance</title>
	<atom:link href="http://ebroder.net/tag/performance/feed/" rel="self" type="application/rss+xml" />
	<link>http://ebroder.net</link>
	<description>Because all the cool names are taken</description>
	<lastBuildDate>Mon, 14 Jun 2010 03:14:41 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>Fast Computing in the Kernel</title>
		<link>http://ebroder.net/2010/01/02/fast-computing-in-the-kernel/</link>
		<comments>http://ebroder.net/2010/01/02/fast-computing-in-the-kernel/#comments</comments>
		<pubDate>Sun, 03 Jan 2010 01:32:55 +0000</pubDate>
		<dc:creator>evan</dc:creator>
				<category><![CDATA[planet sipb]]></category>
		<category><![CDATA[posts]]></category>
		<category><![CDATA[binary translation]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[singularity]]></category>
		<category><![CDATA[virtualization]]></category>
		<category><![CDATA[vmware]]></category>

		<guid isPermaLink="false">http://ebroder.net/?p=326</guid>
		<description><![CDATA[I&#8217;m kind of inspired by Geoffrey&#8217;s speculative write-up on Linux seccomp to do a speculative write-up of my own. Most of the SIPB people around here will recognize this discussion, as we&#8217;ve had it a couple of times. My 6.UAT TA will recognize it as well, since I presented on this as a &#8220;representative M.Eng. <a href='http://ebroder.net/2010/01/02/fast-computing-in-the-kernel/'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m kind of inspired by <a href="http://geofft.mit.edu/blog/">Geoffrey&#8217;s</a> speculative write-up on <a href="http://geofft.mit.edu/blog/sipb/33">Linux seccomp</a> to do a speculative write-up of my own. Most of the SIPB people around here will recognize this discussion, as we&#8217;ve had it a couple of times. My 6.UAT TA will recognize it as well, since I presented on this as a &#8220;representative M.Eng. thesis&#8221;&mdash;that is, something that I <em>could</em> do, but have no intention of <em>actually</em> doing, for my M.Eng. thesis.</p>
<p>To setup the premise here, programs that do a lot of number crunching tend to run fast regardless of how they&#8217;re running, whether that&#8217;s natively, under virtualization, or whatever. They&#8217;re generally allowed to do almost everything they need to without any help from the operating system, or any other layers sitting on top of them.</p>
<p>On the other hand, any program that needs to interact with the outside world at all does so using a <strong>system call</strong>, which is basically a special function that causes the program to jump into the operating system itself. Because you don&#8217;t want random processes to have unfiltered access to raw hardware, a surprising amount of functionality is exposed through system calls, including <tt>read</tt>, <tt>write</tt>, <tt>send</tt>, <tt>recv</tt>. This means that applications such as, say, apache, spend almost all of their time doing system calls, since all a web server really does is <tt>read</tt> a file from disk, and then <tt>send</tt> it over the network.</p>
<p>The problem comes in when you consider the context switch between userspace applications and the kernelspace operating system needed to execute a system call. As it turns out, this context switch is <em>slooow</em>. How slow is it? Well, we can look at <a href="http://www.cs.wisc.edu/areas/os/Seminar/schedules/papers/Deconstructing_Process_Isolation_final.pdf">a paper</a> from Microsoft Research. Their highly experimental operating system, Singularity, is flexible enough that it can run applications either with or without the context switch required in a traditional operating system. Here&#8217;s what they found:</p>
<table>
<tr>
<th>&nbsp;</th>
<th colspan="4">Cost (CPU Cycles)</th>
</tr>
<tr>
<th>&nbsp;</th>
<th>ABI call<sup>[<a name="id326-1" href="#id326-fn1">1</a>]</sup></th>
<th>Yield<sup>[<a name="id326-2" href="#id326-fn2">2</a>]</sup></th>
<th>PSR<sup>[<a name="id326-3" href="#id326-fn3">3</a>]</sup></th>
<th>Create Proc<sup>[<a name="id326-4" href="#id326-fn4">4</a>]</sup></th>
</tr>
<tr>
<th>Singularity<br />SIP-Phys<sup>[<a name="id326-5" href="#id326-fn5">5</a>]</sup></th>
<td>80</td>
<td>365</td>
<td>1,041</td>
<td>388,162</td>
</tr>
<tr>
<th>Singularity<br />HIP-R3<sup>[<a name="id326-6" href="#id326-fn6">6</a>]</sup></th>
<td>304</td>
<td>638</td>
<td>2,580</td>
<td>830,999</td>
</tr>
<tr>
<th>FreeBSD</th>
<td>878</td>
<td>911</td>
<td>13,304</td>
<td>1,032,254</td>
</tr>
<tr>
<th>Linux</th>
<td>437</td>
<td>906</td>
<td>5,797</td>
<td>719,447</td>
</tr>
<tr>
<th>Windows</th>
<td>627</td>
<td>753</td>
<td>6,344</td>
<td>5,375,735</td>
</tr>
</table>
<div style="font-size: 85%">
[<a href="#id326-1" name="id326-fn1">1</a>] Their terminology for a system call. On each operating system tested, they specifically chose a system call that could always return very quickly.<br />
[<a href="#id326-2" name="id326-fn2">2</a>] Surrender remaining time in the current thread of execution and schedule another thread.<br />
[<a href="#id326-3" name="id326-fn3">3</a>] &#8220;Process-Send-Receive&#8221; &#8211; their term for an <acronym title="Inter-process communication">IPC</acronym> benchmark that sends a byte of data back and forth between two separate processes.<br />
[<a href="#id326-4" name="id326-fn4">4</a>] Create a new process. Equivalent to a <tt>fork</tt>+<tt>exec</tt> in UNIX terminology.<br />
[<a href="#id326-5" name="id326-fn5">5</a>] Singularity running without the hardware context switch.<br />
[<a href="#id326-6" name="id326-fn6">6</a>] Singularity running with a hardware context switch.
</div>
<p>What&#8217;s the take-away here? There are two. First, using hardware isolation to Singularity adds almost a factor of 4 on the time to execute a system call. Second, Singularity is way faster than other operating systems, all of which use a hardware context switch (of course, they&#8217;re also much more featureful than Singularity).</p>
<p>So that&#8217;s our problem. To try and solve it, we look to the techniques pioneered by VMWare for total machine virtualization.</p>
<p>When running an operating system under virtualization, we need some way to simulate what would otherwise be privileged operations on raw hardware. There are a lot of approaches to solving this problem, but VMWare primarily uses just-in-time <strong>binary translation</strong> (or BT). With BT, VMWare&#8217;s Virtual Machine Monitor (VMM) examines instructions just before they&#8217;re executed. If there are any unsafe instructions, they&#8217;re replaced with calls into functions in the VMM that emulate those instructions.</p>
<p>That on its own doesn&#8217;t make anything fast, but VMWare takes this a step further. In order to minimize the overhead of this emulation, VMWare&#8217;s VMM runs the translated code <strong>within the kernel</strong> (ring 0). It turns out that, because of this, VMWare&#8217;s VMM has an average slowdown of only 4% (see <a href="http://www.vmware.com/pdf/asplos235_adams.pdf">A Comparison of Software and Hardware Techniques for x86 Virtualization</a> for detailed analysis).</p>
<p>Here&#8217;s the question: can we take the binary translation techniques from VMWare&#8217;s VMM and adapt them to run otherwise unmodified <em>processes</em> instead of <em>operating systems</em> within the kernel? And if we do, what is the performance impact?</p>
<p>If we can bypass the context switch expense measured by the Singularity team, it could easily more than compensate for the relatively small overhead of running applications under binary translation. I would go so far as to say that I expect syscall-heavy applications to run faster.</p>
<p>Putting the Singularity and VMWare papers right next to each other, this is a pretty obvious next step. But as far as I know, nobody&#8217;s done it yet. Does anybody else know of an implementation of this idea for a real operating system? Maybe a Linux kernel module that lets you run certain apps in-kernel? If it&#8217;s out there, I haven&#8217;t found it yet.</p>
]]></content:encoded>
			<wfw:commentRss>http://ebroder.net/2010/01/02/fast-computing-in-the-kernel/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

