Posts Tagged ‘linux’

Complex Systems and Simple Failures

January 25th, 2010 @ 8:45 am UTC

Complex systems do not always fail for complex reasons. They quite often fail for the absolutely dumbest possible reasons.

- nelhage

Like say, for instance, when you’re debugging why one of your servers takes 7 seconds to do what the other server can do in less than 1.

root@black-mesa:~# time lvcreate -n test -L 1G xenvg
 Logical volume "test" created

real    0m0.309s
user    0m0.000s
sys     0m0.008s
root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg
 Logical volume "test" created

real    0m7.282s
user    0m6.396s
sys     0m0.312s

Sometimes it turns out to just be that the directory it’s logging to takes 7 seconds to list:

root@black-mesa:~# time ls -a /etc/lvm/archive/ >/dev/null

real    0m0.005s
user    0m0.000s
sys     0m0.004s

root@torchwood-institute:~# time ls -a /etc/lvm/archive/ >/dev/null

real    0m7.007s
user    0m6.644s
sys     0m0.364s

And occasionally, that’s not just because your disk is failing or you’re running into caching issues. Occasionally, it’s just because that directory somehow has hundreds of thousands of files in it:

root@torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
301369

And very, very rarely, if the gods are smiling on you, deleting those hundreds of thousands of files causes things to work again.

root@torchwood-institute:~# find /etc/lvm/archive -name '.lvm_torchwood-institute.mit.edu_*' -delete
root@torchwood-institute:~# time ls -a /etc/lvm/archive >/dev/null

real	0m0.015s
user	0m0.000s
sys	0m0.012s
root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg
  Logical volume "test" created

real	0m0.341s
user	0m0.000s
sys	0m0.016s
root@torchwood-institute:~# time lvremove -f /dev/xenvg/test
  Logical volume "test" successfully removed

real	0m0.226s
user	0m0.004s
sys	0m0.012s

It kind of sucks to spend a week on and off, talking with developers, trying to figure out what’s going on. But it’s also really nice when it turns out to be something I can just fix myself.

Fast Computing in the Kernel

January 2nd, 2010 @ 8:32 pm UTC

I’m kind of inspired by Geoffrey’s speculative write-up on Linux seccomp to do a speculative write-up of my own. Most of the SIPB people around here will recognize this discussion, as we’ve had it a couple of times. My 6.UAT TA will recognize it as well, since I presented on this as a “representative M.Eng. thesis”—that is, something that I could do, but have no intention of actually doing, for my M.Eng. thesis.

To setup the premise here, programs that do a lot of number crunching tend to run fast regardless of how they’re running, whether that’s natively, under virtualization, or whatever. They’re generally allowed to do almost everything they need to without any help from the operating system, or any other layers sitting on top of them.

On the other hand, any program that needs to interact with the outside world at all does so using a system call, which is basically a special function that causes the program to jump into the operating system itself. Because you don’t want random processes to have unfiltered access to raw hardware, a surprising amount of functionality is exposed through system calls, including read, write, send, recv. This means that applications such as, say, apache, spend almost all of their time doing system calls, since all a web server really does is read a file from disk, and then send it over the network.

The problem comes in when you consider the context switch between userspace applications and the kernelspace operating system needed to execute a system call. As it turns out, this context switch is slooow. How slow is it? Well, we can look at a paper from Microsoft Research. Their highly experimental operating system, Singularity, is flexible enough that it can run applications either with or without the context switch required in a traditional operating system. Here’s what they found:

  Cost (CPU Cycles)
  ABI call[1] Yield[2] PSR[3] Create Proc[4]
Singularity
SIP-Phys[5]
80 365 1,041 388,162
Singularity
HIP-R3[6]
304 638 2,580 830,999
FreeBSD 878 911 13,304 1,032,254
Linux 437 906 5,797 719,447
Windows 627 753 6,344 5,375,735
[1] Their terminology for a system call. On each operating system tested, they specifically chose a system call that could always return very quickly.
[2] Surrender remaining time in the current thread of execution and schedule another thread.
[3] “Process-Send-Receive” – their term for an IPC benchmark that sends a byte of data back and forth between two separate processes.
[4] Create a new process. Equivalent to a fork+exec in UNIX terminology.
[5] Singularity running without the hardware context switch.
[6] Singularity running with a hardware context switch.

What’s the take-away here? There are two. First, using hardware isolation to Singularity adds almost a factor of 4 on the time to execute a system call. Second, Singularity is way faster than other operating systems, all of which use a hardware context switch (of course, they’re also much more featureful than Singularity).

So that’s our problem. To try and solve it, we look to the techniques pioneered by VMWare for total machine virtualization.

When running an operating system under virtualization, we need some way to simulate what would otherwise be privileged operations on raw hardware. There are a lot of approaches to solving this problem, but VMWare primarily uses just-in-time binary translation (or BT). With BT, VMWare’s Virtual Machine Monitor (VMM) examines instructions just before they’re executed. If there are any unsafe instructions, they’re replaced with calls into functions in the VMM that emulate those instructions.

That on its own doesn’t make anything fast, but VMWare takes this a step further. In order to minimize the overhead of this emulation, VMWare’s VMM runs the translated code within the kernel (ring 0). It turns out that, because of this, VMWare’s VMM has an average slowdown of only 4% (see A Comparison of Software and Hardware Techniques for x86 Virtualization for detailed analysis).

Here’s the question: can we take the binary translation techniques from VMWare’s VMM and adapt them to run otherwise unmodified processes instead of operating systems within the kernel? And if we do, what is the performance impact?

If we can bypass the context switch expense measured by the Singularity team, it could easily more than compensate for the relatively small overhead of running applications under binary translation. I would go so far as to say that I expect syscall-heavy applications to run faster.

Putting the Singularity and VMWare papers right next to each other, this is a pretty obvious next step. But as far as I know, nobody’s done it yet. Does anybody else know of an implementation of this idea for a real operating system? Maybe a Linux kernel module that lets you run certain apps in-kernel? If it’s out there, I haven’t found it yet.