Complex systems do not always fail for complex reasons. They quite often fail for the absolutely dumbest possible reasons.

- nelhage

Like say, for instance, when you’re debugging why one of your servers takes 7 seconds to do what the other server can do in less than 1.

root@black-mesa:~# time lvcreate -n test -L 1G xenvg
 Logical volume "test" created

real    0m0.309s
user    0m0.000s
sys     0m0.008s
root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg
 Logical volume "test" created

real    0m7.282s
user    0m6.396s
sys     0m0.312s

Sometimes it turns out to just be that the directory it’s logging to takes 7 seconds to list:

root@black-mesa:~# time ls -a /etc/lvm/archive/ >/dev/null

real    0m0.005s
user    0m0.000s
sys     0m0.004s

root@torchwood-institute:~# time ls -a /etc/lvm/archive/ >/dev/null

real    0m7.007s
user    0m6.644s
sys     0m0.364s

And occasionally, that’s not just because your disk is failing or you’re running into caching issues. Occasionally, it’s just because that directory somehow has hundreds of thousands of files in it:

root@torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
301369

And very, very rarely, if the gods are smiling on you, deleting those hundreds of thousands of files causes things to work again.

root@torchwood-institute:~# find /etc/lvm/archive -name '.lvm_torchwood-institute.mit.edu_*' -delete
root@torchwood-institute:~# time ls -a /etc/lvm/archive >/dev/null

real	0m0.015s
user	0m0.000s
sys	0m0.012s
root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg
  Logical volume "test" created

real	0m0.341s
user	0m0.000s
sys	0m0.016s
root@torchwood-institute:~# time lvremove -f /dev/xenvg/test
  Logical volume "test" successfully removed

real	0m0.226s
user	0m0.004s
sys	0m0.012s

It kind of sucks to spend a week on and off, talking with developers, trying to figure out what’s going on. But it’s also really nice when it turns out to be something I can just fix myself.

So, at some point in the recent past (I don’t think I noticed when), ssh-keygen started displaying a “randomart” representation of keys that it generates:

fanty:~ evan$ ssh-keygen -f test -C evan
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in test.
Your public key has been saved in test.pub.
The key fingerprint is:
20:e9:b0:5b:5a:2b:ad:e8:4d:e4:b3:a0:32:49:2d:97 evan
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|     .           |
|  . o .          |
|   + . .         |
|  o.=   S        |
| ooE .           |
|.o*+o            |
|=.+oo            |
|=o.o             |
+-----------------+

Does anybody know why or what it’s supposed to mean?

Software engineers often find themselves in a world that seems contradictory. On the one hand, the amount of work our computers do just to watch a Flash video is truly staggering. On the other hand, half the time our computers slow to a grinding halt and crash when we try to watch a Flash video.

We have home internet connections that allow us to download high definition movies in a matter of 15 minutes. But as soon as someone starts trying to download one of those movies, everybody else on the network finds their connection slowed to an unusable crawl.

Look at how far we’ve come! …and how far away we still are.

If you want to share in the misery that is the world of computers, come join me at my new site, ibtsocs.com, short for “I Bemoan The State Of Computer Science”, an expression of all that is right but still wrong. It’s like FML, but for developers.

Note: I’ve been holding onto the domain for about a month now, but I decided to build the site last night. As such, it’s just about the most hacked together thing you can imagine. Which really seems to fit in with the theme of the website. Expect more features to come soon!

Meanwhile, this blog post was thrown together in about 5 minutes to make this week’s deadline for the Iron Blogger event. Hence the lack of eloquence or even interesting commentary on what a crazy idea it is to decide to throw a website together in 8 hours.

I’m kind of inspired by Geoffrey’s speculative write-up on Linux seccomp to do a speculative write-up of my own. Most of the SIPB people around here will recognize this discussion, as we’ve had it a couple of times. My 6.UAT TA will recognize it as well, since I presented on this as a “representative M.Eng. thesis”—that is, something that I could do, but have no intention of actually doing, for my M.Eng. thesis.

To setup the premise here, programs that do a lot of number crunching tend to run fast regardless of how they’re running, whether that’s natively, under virtualization, or whatever. They’re generally allowed to do almost everything they need to without any help from the operating system, or any other layers sitting on top of them.

On the other hand, any program that needs to interact with the outside world at all does so using a system call, which is basically a special function that causes the program to jump into the operating system itself. Because you don’t want random processes to have unfiltered access to raw hardware, a surprising amount of functionality is exposed through system calls, including read, write, send, recv. This means that applications such as, say, apache, spend almost all of their time doing system calls, since all a web server really does is read a file from disk, and then send it over the network.

The problem comes in when you consider the context switch between userspace applications and the kernelspace operating system needed to execute a system call. As it turns out, this context switch is slooow. How slow is it? Well, we can look at a paper from Microsoft Research. Their highly experimental operating system, Singularity, is flexible enough that it can run applications either with or without the context switch required in a traditional operating system. Here’s what they found:

  Cost (CPU Cycles)
  ABI call[1] Yield[2] PSR[3] Create Proc[4]
Singularity
SIP-Phys[5]
80 365 1,041 388,162
Singularity
HIP-R3[6]
304 638 2,580 830,999
FreeBSD 878 911 13,304 1,032,254
Linux 437 906 5,797 719,447
Windows 627 753 6,344 5,375,735
[1] Their terminology for a system call. On each operating system tested, they specifically chose a system call that could always return very quickly.
[2] Surrender remaining time in the current thread of execution and schedule another thread.
[3] “Process-Send-Receive” – their term for an IPC benchmark that sends a byte of data back and forth between two separate processes.
[4] Create a new process. Equivalent to a fork+exec in UNIX terminology.
[5] Singularity running without the hardware context switch.
[6] Singularity running with a hardware context switch.

What’s the take-away here? There are two. First, using hardware isolation to Singularity adds almost a factor of 4 on the time to execute a system call. Second, Singularity is way faster than other operating systems, all of which use a hardware context switch (of course, they’re also much more featureful than Singularity).

So that’s our problem. To try and solve it, we look to the techniques pioneered by VMWare for total machine virtualization.

When running an operating system under virtualization, we need some way to simulate what would otherwise be privileged operations on raw hardware. There are a lot of approaches to solving this problem, but VMWare primarily uses just-in-time binary translation (or BT). With BT, VMWare’s Virtual Machine Monitor (VMM) examines instructions just before they’re executed. If there are any unsafe instructions, they’re replaced with calls into functions in the VMM that emulate those instructions.

That on its own doesn’t make anything fast, but VMWare takes this a step further. In order to minimize the overhead of this emulation, VMWare’s VMM runs the translated code within the kernel (ring 0). It turns out that, because of this, VMWare’s VMM has an average slowdown of only 4% (see A Comparison of Software and Hardware Techniques for x86 Virtualization for detailed analysis).

Here’s the question: can we take the binary translation techniques from VMWare’s VMM and adapt them to run otherwise unmodified processes instead of operating systems within the kernel? And if we do, what is the performance impact?

If we can bypass the context switch expense measured by the Singularity team, it could easily more than compensate for the relatively small overhead of running applications under binary translation. I would go so far as to say that I expect syscall-heavy applications to run faster.

Putting the Singularity and VMWare papers right next to each other, this is a pretty obvious next step. But as far as I know, nobody’s done it yet. Does anybody else know of an implementation of this idea for a real operating system? Maybe a Linux kernel module that lets you run certain apps in-kernel? If it’s out there, I haven’t found it yet.

(Is there a non-ambiguous abbreviation for “appliance”? I don’t want to use “app builders”, because people would obviously get the wrong impression…)

I’m still looking for an appliance builder that has everything I want. Right now the three software packages on my list are Cobbler+koan, Thincrust, or maybe Kiwi.

I started looking into Kiwi a while back, but backed off because they seem to have DRY problems. Not to mention it’s written in Perl.

Cobbler and Thincrust look a little more promising, at least on the surface, but it’s hard to get a good sense of the kind of flexibility I can get out of them. It certainly doesn’t look like either of them have the ability to install a Debian/Ubuntu system without being handed the 20 lines of required pre-seed, but I could be wrong.

Does anybody have experience with these? Does anybody know if they fit the 4 features from last time, or could be hammered into fitting them?

I’ve said it before – there are a lot of appliance builders out there. With virtualization and the cloud being the hot ticket items of the day, everybody wants to try their hand at writing the software to provision those VMs.

Unfortunately, they all seem to suck. At least, the Debian/Ubuntu ones do. I haven’t found a VM or appliance builder application that I like, mostly because they all seem to be bad knock-offs of the actual debian-installer or ubuntu-installer.

The appliance builder I want has four key features:

  1. It should run unattended.

    This one is kind of obvious, but rules out options like just running the debian-installer by hand and answering the questions as they come up. I do a lot of repetitive installs, and it’s important that I can hand my appliance builder a pre-crafted config file and get a customized, but totally unattended install.

  2. It should run trivially in a virtual environment, and seamlessly supports multiple hypervisors.

    All of the appliance builders that anybody uses, or at least the ones I’ve attempted to use (VMBuilder and xen-create-image) run in the hypervisor. This is anywhere from an inconvenience to an actual security threat.

    I want to be able to offer users a high degree of customizability, but my users are generally untrusted, and you simply can’t allow any flexibility when the appliance installer runs as root on your hypervisor. You certainly can’t allow your users to install packages out of their own apt repositories, including PPAs – a targeted attacker can easily break out of the chroot they’re put into when their package installs, and any package can include code that runs as root. Even if you don’t allow your users to customize appliances, the principal of least privilege says you shouldn’t be running the installs as root when you can run them as not-root, and you pretty clearly can.

    Therefore, being able to run the appliance builder in a VM is an absolute must, regardless of the performance hit. We were able to adapt xen-create-image to do this for Invirt, but it wasn’t pretty, it took a lot of shoehorning, and it’s still pretty fragile.

    Not only do I want to be able to install my appliances in a guest, but I also want to be able to run that guest under various virtualization environments. Many of my deployments are still heavily dependent on Xen. I have other deployments using KVM. Ideally, I’d like my appliance builder to work fairly transparently with multiple virtualization environments, although it’s probably OK for me if the resulting appliance image only works with the particular hypervisor that created it.

  3. It should use the distributions installer mechanism instead of jerry-rigging its own.

    All of the appliance building applications I know of use their own installation code. For Debian/Ubuntu installers, this means running debootstrap and then frobbing the output. Even kiwi, the software behind the very shiny SUSE Studio effectively starts by unpacking a list of RPMs by hand.

    There’s a lot of complexity in the Debian/Ubuntu installers. When you try to duplicate it, you will get it wrong. The resulting system will not be equivalent to the same system installed using a CD. I’ve certainly seen cases before where an installer-built image was different than an appliance-builder-built image, and it’s incredibly frustrating. Maybe this is something that could be fixed by actively developing the appliance builder (Ubuntu’s VMBuilder seems to be getting help from the ubuntu-installer developers), but it inherently seems like a waste of time to have this kind of code duplication.

  4. It should have a layer of abstraction that keeps me from repeating myself.

    Simply booting the debian-installer or ubuntu-installer with a preseed file would certainly address the first three points. However, the preseed file needed simply to get an unattended Ubuntu install with no other bells and whistles is more than 20 lines long. Even if I have a template I can copy around, it’s gross from a DRY perspective.

    I want my appliance builder to be configured through a config format that abstracts that away. I only want to specify that which can’t be reasonably guessed, not everything that I might want to have a say about.

All of the virtualization projects I’m involved in right now – Invirt, Virtigo, and some smaller personal projects – could really benefit from this kind of infrastructure piece, which means I’m likely to attempt to write it if it doesn’t exist. And as far as I know, this kind of appliance building application doesn’t exist for Debian and Ubuntu, at the very least. I’ll admit that I know almost nothing about other Linux distributions. Do any of them get this more right?

As part of my summer internship, I needed to write an installer for VMs. For various reasons, I wasn’t able to use the multitude of VM installers already out there, but one thing I noticed is that most of them don’t actually install a bootloader. They create a /boot/grub/menu.lst, but never run grub-install.

Turns out this is because it’s hard to do. grub-install is very complicated and seems to be pretty explicitly designed for the case of running in an installer environment, where all of the disks and block devices are laid out the same way as they will be the next time you boot. When you’re installing in a host into a loop mount or something, that’s definitely not the case.

In trying to make this work, I discovered a few core issues:

  • grub-install assumes that the block device you’re installing onto “looks like” the sort of device you’d normally install GRUB onto (i.e. is named like a hard disk or floppy – hda, sda, fd0, etc.)
  • grub-install uses df to determine the block device a given file or directory’s filesystem is on. That works really poorly when you’re already chrooting into your loop mount.

If you read my wording carefully, you might see where I’m going with this. In order to get grub-install to work, I needed to convince it it’s installing onto a hard drive, and I needed to run it outside of the loop mount.

The former is obviously a bit more challenging, and to accomplish that, I used the device-mapper to create a node named something like /dev/mapper/hda.

I’ve only tested this on an Ubuntu Jaunty host so far, so I can’t guarantee that it works on Debian or even other Ubuntu versions, but I think it should. I’d love to hear if you have good or bad experiences on other Linux versions.

Here’s roughly how it works (you’ve probably performed some of these steps already in the process of running an installer):

  1. Loop mount your partitioned disk image:
    mathias:~ evan$ sudo losetup --show --find disk.img
    /dev/loop0
  2. To setup the device map, you’ll need the major and minor numbers of the loop device, and the size (in bytes) of the disk. The latter is easiest to get from the disk image file, instead of from the loop device (emphasis mine):
    mathias:~ evan$ ls -l /dev/loop0
    brw-rw---- 1 root disk 7, 0 2009-07-18 11:27 /dev/loop0
    mathias:~ evan$ ls -l disk.img
    -rw-r--r-- 1 evan evan 10737418240 2009-08-04 15:28 disk.img
  3. Create a device-mapper node. Any name of the form hd[a-z], sd[a-z], or vd[a-z] will work. Others might as well. The size of the disk should be converted to 512-byte sectors, and the device numbers for the loop device should be in the form major:minor. This will create a new device node in /dev/mapper:
    mathias:~ evan$ echo '0 20971520 linear 7:0 0' | sudo dmsetup create hda
    mathias:~ evan$ ls -l /dev/mapper/hda
    brw-rw---- 1 root disk 252, 4 2009-08-04 15:36 /dev/mapper/hda
    
  4. Use kpartx to create device-mapper nodes for the partitions on the disk image:
    mathias:~ evan$ sudo kpartx -a /dev/mapper/hda
    mathias:~ evan$ ls -l /dev/mapper/hda*
    brw-rw---- 1 root disk 252, 4 2009-08-04 15:36 /dev/mapper/hda
    brw-rw---- 1 root disk 252, 5 2009-08-04 15:38 /dev/mapper/hda1
    brw-rw---- 1 root disk 252, 6 2009-08-04 15:38 /dev/mapper/hda2
  5. Mount the root partition onto a tempdir (note: this is not a loop mount, because the kernel already thinks this is a real block device):
    mathias:~ evan$ mktemp -d
    /tmp/tmp.MPUXeJWqpn
    mathias:~ evan$ sudo mount /dev/mapper/hda1 /tmp/tmp.MPUXeJWqpn
  6. Create a fake device.map for grub-install to use (yeah, this is a bad use of tee, but I’m trying to be clear about what I’m doing):
    mathias:~ evan$ echo '(hd0) /dev/mapper/hda' | sudo tee /tmp/tmp.MPUXeJWqpn/boot/grub/device.map
    (hd0) /dev/mapper/hda
  7. And now, for the grand finale, actually install GRUB from outside the chroot:
    mathias:~ evan$ sudo grub-install --root-directory=/tmp/tmp.MPUXeJWqpn /dev/mapper/hda
    grub-probe: error: no mapping exists for `hda1'
    [: 494: =: unexpected operator
    Installing GRUB to /dev/mapper/hda as (hd0)...
    Installation finished. No error reported.
    This is the contents of the device map /tmp/tmp.MPUXeJWqpn/boot/grub/device.map.
    Check if this is correct or not. If any of the lines is incorrect,
    fix it and re-run the script `grub-install'.
    
    (hd0) /dev/mapper/hda

    (You don’t need to worry about those two errors at the beginning of the output – it’s some logic specialized for XFS filesystems)

  8. Cleanup the mess you made:
    mathias:~ evan$ sudo umount /tmp/tmp.MPUXeJWqpn
    mathias:~ evan$ sudo rm -rf /tmp/tmp.MPUXeJWqpn
    mathias:~ evan$ sudo kpartx -d /dev/mapper/hda
    mathias:~ evan$ sudo dmsetup remove hda
    mathias:~ evan$ sudo losetup -d /dev/loop0
  9. Finally, examine your disk image, and see that it definitely has GRUB installed:
    mathias:~ evan$ file disk.img
    disk.img: x86 boot sector; GRand Unified Bootloader, stage1 version 0x3, 1st sector stage2 0x884009; partition 1: ID=0x83, active, starthead 0, startsector 1, 18876374 sectors; partition 2: ID=0x82, starthead 254, startsector 18876375, 2088450 sectors

And there you have it! You will, of course, still need to write out GRUB’s menu.lst through some other means (such as Debian/Ubuntu’s update-grub).

© 2011 No Name Blog Suffusion theme by Sayontan Sinha