Author Archive

Making Xen Suck Less: Part 3

February 21st, 2010 @ 11:23 am UTC

Paravirtualized Clocks

In theory, Xen dom0′s are supposed to forcibly sync their system clock to the domU’s. In practice, due to some incompatibility between either Ubuntu’s version of the dom0 or domU patches, that doesn’t work, even though the feature is enabled, which leads to clock drift and occasionally weird clock lockup bugs.

The easiest way to fix this is to disable the Xen clock syncing entirely, and rely on the standard Linux clock mechanism. You can do that by adding these two lines just before exit 0 in /etc/rc.local:

echo '1' > /proc/sys/xen/independent_wallclock
echo 'jiffies' > /sys/devices/system/clocksource/clocksource0/current_clocksource

You’ll want to be sure to run NTP or some other service to keep your clock in sync.

Epic Softfail

February 11th, 2010 @ 2:26 pm UTC

Somebody complained today that an e-mail I sent got caught in MIT’s spam filters, so I took a look at the message to see if I could figure out why it went through for me.

(For the record, using just the headers at your disposal to figure out why spam filtering is doing something strange is always a futile endeavor)

I didn’t figure out what was going on, but then I noticed this in the headers:

Received: by 10.102.218.17 with SMTP id q17cs56702mug;
        Thu, 11 Feb 2010 09:42:54 -0800 (PST)
Received: by 10.224.59.28 with SMTP id j28mr118230qah.109.1265910085825;
        Thu, 11 Feb 2010 09:41:25 -0800 (PST)
Return-Path:

Received: from dmz-mailsec-scanner-4.mit.edu (DMZ-MAILSEC-SCANNER-4.MIT.EDU [18.9.25.15])
        by mx.google.com with ESMTP id 17si5760822qyk.35.2010.02.11.09.41.25;
        Thu, 11 Feb 2010 09:41:25 -0800 (PST)
Received-SPF: softfail (google.com: domain of transitioning prvs=165867b240=uptrack@ksplice.com does not designate 18.9.25.15 as permitted sender) client-ip=18.9.25.15;
Authentication-Results: mx.google.com; spf=softfail (google.com: domain of transitioning prvs=165867b240=uptrack@ksplice.com does not designate 18.9.25.15 as permitted sender) smtp.mail=prvs=165867b240=uptrack@ksplice.com
Received: from mailhub-dmz-1.mit.edu (MAILHUB-DMZ-1.MIT.EDU [18.9.21.41])
	by dmz-mailsec-scanner-4.mit.edu (Symantec Brightmail Gateway) with SMTP id A4.AB.13801.441447B4; Thu, 11 Feb 2010 12:41:24 -0500 (EST)
Received: from dmz-mailsec-scanner-1.mit.edu (DMZ-MAILSEC-SCANNER-1.MIT.EDU [18.9.25.12])
	by mailhub-dmz-1.mit.edu (8.13.8/8.9.2) with ESMTP id o1BHdVsa009188
	for ; Thu, 11 Feb 2010 12:41:23 -0500
X-AuditID: 1209190f-b7bbfae0000035e9-cf-4b7441445cb2
Received: from mail-qy0-f202.google.com (mail-qy0-f202.google.com [209.85.221.202])
	by dmz-mailsec-scanner-1.mit.edu (Symantec Brightmail Gateway) with SMTP id 6B.AF.10714.241447B4; Thu, 11 Feb 2010 12:41:22 -0500 (EST)
Received: by qyk40 with SMTP id 40so1272989qyk.14
        for ; Thu, 11 Feb 2010 09:41:22 -0800 (PST)
Received: by 10.229.130.205 with SMTP id u13mr94912qcs.47.1265910082497;
        Thu, 11 Feb 2010 09:41:22 -0800 (PST)
Received: from ksplice.com ([64.27.0.149])
        by mx.google.com with ESMTPS id 20sm1621806qyk.9.2010.02.11.09.41.20
        (version=TLSv1/SSLv3 cipher=RC4-MD5);
        Thu, 11 Feb 2010 09:41:21 -0800 (PST)

A lot of spew, of course, but the interesting lines are the two SPF softfails near the top: “domain of transitioning prvs=165867b240=uptrack@ksplice.com does not designate 18.9.25.15 as permitted sender

It took me a little while to figure out what was going on – I know that Ksplice sends its e-mails through Gmail, and I know that ksplice.com’s SPF record includes the Gmail mail servers.

But that e-mail went to a list at MIT, which then expanded to my MIT e-mail address, which then forwards to my GAFYD address. That SPF validation was performed by Gmail when it received the e-mail from MIT’s mail servers, and MIT’s mail servers aren’t authorized to send mail from ksplice.com.

I’m not sure how this is avoidable for this sort of mail forwarding – MIT’s mail servers could just as easily be spammers pretending to forward mail from ksplice.com. Maybe the solution is some way for me to tell Gmail that MIT is authorized to send my mail to me. Either way, it’s just more proof that SPF doesn’t work.

Update: Anders points out that this can be solved with “Sender Rewriting Scheme“, which basically just changes the envelope on the message to something that contains an obfuscated form of the original e-mail, but whose domain is that of the forwarder.

Making Xen Suck Less: Part 2

February 10th, 2010 @ 9:32 pm UTC

Now for part two in my ongoing series on making Xen suck less. Last time we looked at making networking work for hardware virtualized machines. Networking for paravirtualized VMs does work out of the box, but this hint might help if you’re running into performance problems.

Paravirtualized Networking Performance

If you’re running a web server or some other server that’s sending large files (or sometimes small files), you may find that your VM seems to hang inexplicably on those transfers.

For some reason, the paravirtualized Xen networking drivers advertise that they support on-board TCP segmentation. In fact, they seem to pass the packets onto the wire un-segmented, which frequently will cause the packets to be dropped for going over the MTU.

If you’re using xen-create-image, there’s a commented out line in /etc/network/interfaces that runs ethtool -K eth0 tx off. That’s close to the right issue. You actually want to add a line to your /etc/network/interfaces so that it looks something like this:

auto eth0
iface eth0 inet static
 address 18.181.0.80
 gateway 18.181.0.1
 netmask 255.255.0.0

 post-up ethtool -K eth0 tso off

Stay tuned for more hints, including how to deal with clock issues and magic sysrqs. I’ll also be pulling walkthroughs together on converting paravirtualized to hardware virtualized VMs, and how to upgrade older Ubuntu releases to more recent ones safely.

Making Xen Suck Less: Part 1

February 7th, 2010 @ 12:55 pm UTC

Right now, all of my Xen dom0′s run Ubuntu Hardy with Xen 3.3 from hardy-backports. Before we even talk about making Xen work, that statement bears some looking at.

I use Xen for a variety of reasons. Some are historical – the Invirt Project was built on top of Xen, and migrating away from Xen to a solution like KVM or VMWare would require working with users that are running paravirtualized operating systems. Some are circumstantial – I still have hardware that doesn’t support hardware virtualization.

My reasons for using Ubuntu are far less logical – I know how to use it, and don’t want to learn anything else if I don’t have to.

And I use Xen 3.3 because it’s way more stable than Xen 3.2.

In any case, if you find yourself using Xen 3.3 on Ubuntu Hardy as a dom0, there are a lot of tricks I’ve picked up for making it work better. Over the next few weeks, I’ll be working my way through them. I’ll be tagging them all with xen-tips for easy retrieval later.

As a disclaimer, I have no idea if these problems have been fixed in later versions of Xen or Linux, or if they’re specific to the Xen and/or kernel shipped by Ubuntu. For me, there’s a lot of value in getting all of my software from my distribution, so these instructions are designed to help do that.

HVM Networking

I have no idea whose fault this is, but HVM networking just doesn’t seem to work out of the box. qemu-dm, which emulates the VM’s devices, hooks the VM to a tap net device, while Xen sets up networking for a vifN.0 device. As far as I can tell, the intent was to connect the tap and vif devices, but nothing does.

For Invirt, we worked around this by writing a wrapper script around qemu-dm to make sure everything was setup correctly. If you want to use this script, you can drop qemu-dm-invirt in /usr/sbin and qemu-ifup in /etc/xen. (You’ll probably want to replace vif-invirtroute in qemu-ifup with vif-bridge or vif-route or whatever networking script you’re using).

/usr/lib/xen/bin/qemu-dm is hard-coded to run /etc/xen/qemu-ifup, if it exists. Without the qemu-dm-invirt wrapper, though, qemu-ifup doesn’t have any access to the domain ID for the domain it’s setting up. qemu-ifup then sets up and triggers the normal Xen networking script, which repeats the same setup it did for the vifN.0 interface.

Then, in your Xen config file, be sure to set device_model = '/usr/sbin/qemu-dm-invirt'.

Complex Systems and Simple Failures

January 25th, 2010 @ 8:45 am UTC

Complex systems do not always fail for complex reasons. They quite often fail for the absolutely dumbest possible reasons.

- nelhage

Like say, for instance, when you’re debugging why one of your servers takes 7 seconds to do what the other server can do in less than 1.

root@black-mesa:~# time lvcreate -n test -L 1G xenvg
 Logical volume "test" created

real    0m0.309s
user    0m0.000s
sys     0m0.008s
root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg
 Logical volume "test" created

real    0m7.282s
user    0m6.396s
sys     0m0.312s

Sometimes it turns out to just be that the directory it’s logging to takes 7 seconds to list:

root@black-mesa:~# time ls -a /etc/lvm/archive/ >/dev/null

real    0m0.005s
user    0m0.000s
sys     0m0.004s

root@torchwood-institute:~# time ls -a /etc/lvm/archive/ >/dev/null

real    0m7.007s
user    0m6.644s
sys     0m0.364s

And occasionally, that’s not just because your disk is failing or you’re running into caching issues. Occasionally, it’s just because that directory somehow has hundreds of thousands of files in it:

root@torchwood-institute:~# ls -a /etc/lvm/archive | wc -l
301369

And very, very rarely, if the gods are smiling on you, deleting those hundreds of thousands of files causes things to work again.

root@torchwood-institute:~# find /etc/lvm/archive -name '.lvm_torchwood-institute.mit.edu_*' -delete
root@torchwood-institute:~# time ls -a /etc/lvm/archive >/dev/null

real	0m0.015s
user	0m0.000s
sys	0m0.012s
root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg
  Logical volume "test" created

real	0m0.341s
user	0m0.000s
sys	0m0.016s
root@torchwood-institute:~# time lvremove -f /dev/xenvg/test
  Logical volume "test" successfully removed

real	0m0.226s
user	0m0.004s
sys	0m0.012s

It kind of sucks to spend a week on and off, talking with developers, trying to figure out what’s going on. But it’s also really nice when it turns out to be something I can just fix myself.

ssh-keygen randomart

January 18th, 2010 @ 2:35 pm UTC

So, at some point in the recent past (I don’t think I noticed when), ssh-keygen started displaying a “randomart” representation of keys that it generates:

fanty:~ evan$ ssh-keygen -f test -C evan
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in test.
Your public key has been saved in test.pub.
The key fingerprint is:
20:e9:b0:5b:5a:2b:ad:e8:4d:e4:b3:a0:32:49:2d:97 evan
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|     .           |
|  . o .          |
|   + . .         |
|  o.=   S        |
| ooE .           |
|.o*+o            |
|=.+oo            |
|=o.o             |
+-----------------+

Does anybody know why or what it’s supposed to mean?

Black Bean Soup

January 13th, 2010 @ 12:41 am UTC

Some of you may not have heard yet, but I had jaw surgery (technically, “orthognathic surgery”—now you can go find the Wikipedia article) over winter break. The surgery was to correct for a long-standing crossbite that had apparently gotten worse in the last few years.

Before anybody asks, I’m completely fine. I’ve been completely fine. I am not currently and at no point was in any pain.

However, I do still have an acryllic splint molded all the way around my upper teeth and running across the roof of my mouth to hold my teeth in place while the bone heals, which has left me unable to speak clearly. I’ve also been on a strictly no-chewing diet, up until next week (at which point I move on to a very-soft-foods-only diet). No chewing, in this case, means that, if I have to put it between my teeth and apply any pressure before I can swallow it, then I can’t eat it.

It is certainly better than, say, a liquids-only diet, but it still applies some pretty hefty restrictions to what I can eat. You have to chew more food than you realize. Even most soups don’t work on their own, because you usually chew down any pieces of vegetable or meat in them. So, for the last 3 weeks, my diet has been almost exclusively yogurt, mashed potatoes (or sweet potatoes), soups (frequently after they’ve been run through a blender), and smoothies.

I was hoping that I would get a blog entry out of the soup I made tonight. Unfortunately, it was pretty lackluster. So instead I’ll fall back on one of my mom’s recipes, that’s probably the best food-that-wasn’t-intended-to-be-put-through-a-blender I’ve had since the surgery.

Black Bean Soup

2 tablespoons vegetable oil
3/4 cup white onion, diced
3/4 cup celery, diced
1/2 cup carrot, diced
1/4 cup green pepper, diced
2 tablespoons garlic, minced
60 ounces canned black beans
4 cups chicken stock
1 tablespoon apple cider vinegar
2 teaspoons chili powder
1/2 teaspoon cayenne pepper
1/2 teaspoon cumin
1/2 teaspoon salt
1/4 teaspoon concentrated liquid smoke (hickory)
cheddar/monterey jack, shredded
green onions, chopped

  1. Heat 2 tablespoons of oil in a large saucepan over medium/low heat. Add onion, celery, carrot, bell pepper, and garlic to the oil and simmer slowly (or “sweat” as it’s called), for 15 minutes or until the onions are practically clear. Keep the heat low enough that the veggies don’t brown.
  2. While you cook the veggies, pour the canned beans into a strainer and rinse them under cold water.
  3. Measure 3 cups of the drained and strained beans into a food processor with 1 cup of chicken stock. Puree on high speed until smooth.
  4. When the veggies are ready, pour the pureed beans, the whole beans, the rest of the chicken stock, and every other ingredient in the list (down to liquid smoke), to the pot. Bring mixture to a boil, then reduce heat and simmer uncovered for 50 to 60 minutes or until soup has thickened and all the ingredients are tender. Serve the soup topped with a couple tablespoons of the cheese blend and a teaspoon or so of chopped green onion.

I can’t say how this is normally. After it’s run through the blender, the beans make it incredibly thick—a lot like refried beans, but black beans, so it’s tastier. Mom thought it looked a little questionable, but the thickness for me was a welcome contrast from a lot of the thinner soups I was living off of.

Modifications

I have no idea whether Mom used the liquid smoke or not; it doesn’t seem like the kind of thing she’d keep around. I know she cooked it in a slow cooker instead of on the stove, but that may have mostly been because she just got the slow cooker and was excited. And I know she used more garlic than the recipe calls for—;my mom and I both assume that the garlic in a recipe should be at least doubled, if not quadrupled.

Other

Of course, one of my favorite recipe sites (seriously, the photos are like food porn) posted her own Black Bean Soup recipe today. Another mother enjoying her new slow cooker, it seems. I’m intrigued by smitten kitchen’s abundance of bell peppers. I would love to do a side-by-side if I have a chance some time.

Other Other

I have a few other recipes that have adapted well to my current situation, but are also quite good before they’ve been through a blender. I’ll probably post some of them over the next few weeks. But I’m also looking for ways to add more variety! Do you have ideas for food or other soups that either are already well pulverized or would still be good after they’re run through a blender?

IBTSOCS

January 11th, 2010 @ 5:21 am UTC

Software engineers often find themselves in a world that seems contradictory. On the one hand, the amount of work our computers do just to watch a Flash video is truly staggering. On the other hand, half the time our computers slow to a grinding halt and crash when we try to watch a Flash video.

We have home internet connections that allow us to download high definition movies in a matter of 15 minutes. But as soon as someone starts trying to download one of those movies, everybody else on the network finds their connection slowed to an unusable crawl.

Look at how far we’ve come! …and how far away we still are.

If you want to share in the misery that is the world of computers, come join me at my new site, ibtsocs.com, short for “I Bemoan The State Of Computer Science”, an expression of all that is right but still wrong. It’s like FML, but for developers.

Note: I’ve been holding onto the domain for about a month now, but I decided to build the site last night. As such, it’s just about the most hacked together thing you can imagine. Which really seems to fit in with the theme of the website. Expect more features to come soon!

Meanwhile, this blog post was thrown together in about 5 minutes to make this week’s deadline for the Iron Blogger event. Hence the lack of eloquence or even interesting commentary on what a crazy idea it is to decide to throw a website together in 8 hours.

Fast Computing in the Kernel

January 2nd, 2010 @ 8:32 pm UTC

I’m kind of inspired by Geoffrey’s speculative write-up on Linux seccomp to do a speculative write-up of my own. Most of the SIPB people around here will recognize this discussion, as we’ve had it a couple of times. My 6.UAT TA will recognize it as well, since I presented on this as a “representative M.Eng. thesis”—that is, something that I could do, but have no intention of actually doing, for my M.Eng. thesis.

To setup the premise here, programs that do a lot of number crunching tend to run fast regardless of how they’re running, whether that’s natively, under virtualization, or whatever. They’re generally allowed to do almost everything they need to without any help from the operating system, or any other layers sitting on top of them.

On the other hand, any program that needs to interact with the outside world at all does so using a system call, which is basically a special function that causes the program to jump into the operating system itself. Because you don’t want random processes to have unfiltered access to raw hardware, a surprising amount of functionality is exposed through system calls, including read, write, send, recv. This means that applications such as, say, apache, spend almost all of their time doing system calls, since all a web server really does is read a file from disk, and then send it over the network.

The problem comes in when you consider the context switch between userspace applications and the kernelspace operating system needed to execute a system call. As it turns out, this context switch is slooow. How slow is it? Well, we can look at a paper from Microsoft Research. Their highly experimental operating system, Singularity, is flexible enough that it can run applications either with or without the context switch required in a traditional operating system. Here’s what they found:

  Cost (CPU Cycles)
  ABI call[1] Yield[2] PSR[3] Create Proc[4]
Singularity
SIP-Phys[5]
80 365 1,041 388,162
Singularity
HIP-R3[6]
304 638 2,580 830,999
FreeBSD 878 911 13,304 1,032,254
Linux 437 906 5,797 719,447
Windows 627 753 6,344 5,375,735
[1] Their terminology for a system call. On each operating system tested, they specifically chose a system call that could always return very quickly.
[2] Surrender remaining time in the current thread of execution and schedule another thread.
[3] “Process-Send-Receive” – their term for an IPC benchmark that sends a byte of data back and forth between two separate processes.
[4] Create a new process. Equivalent to a fork+exec in UNIX terminology.
[5] Singularity running without the hardware context switch.
[6] Singularity running with a hardware context switch.

What’s the take-away here? There are two. First, using hardware isolation to Singularity adds almost a factor of 4 on the time to execute a system call. Second, Singularity is way faster than other operating systems, all of which use a hardware context switch (of course, they’re also much more featureful than Singularity).

So that’s our problem. To try and solve it, we look to the techniques pioneered by VMWare for total machine virtualization.

When running an operating system under virtualization, we need some way to simulate what would otherwise be privileged operations on raw hardware. There are a lot of approaches to solving this problem, but VMWare primarily uses just-in-time binary translation (or BT). With BT, VMWare’s Virtual Machine Monitor (VMM) examines instructions just before they’re executed. If there are any unsafe instructions, they’re replaced with calls into functions in the VMM that emulate those instructions.

That on its own doesn’t make anything fast, but VMWare takes this a step further. In order to minimize the overhead of this emulation, VMWare’s VMM runs the translated code within the kernel (ring 0). It turns out that, because of this, VMWare’s VMM has an average slowdown of only 4% (see A Comparison of Software and Hardware Techniques for x86 Virtualization for detailed analysis).

Here’s the question: can we take the binary translation techniques from VMWare’s VMM and adapt them to run otherwise unmodified processes instead of operating systems within the kernel? And if we do, what is the performance impact?

If we can bypass the context switch expense measured by the Singularity team, it could easily more than compensate for the relatively small overhead of running applications under binary translation. I would go so far as to say that I expect syscall-heavy applications to run faster.

Putting the Singularity and VMWare papers right next to each other, this is a pretty obvious next step. But as far as I know, nobody’s done it yet. Does anybody else know of an implementation of this idea for a real operating system? Maybe a Linux kernel module that lets you run certain apps in-kernel? If it’s out there, I haven’t found it yet.

Continuing the Search for Appliance Builders

December 18th, 2009 @ 12:33 am UTC

(Is there a non-ambiguous abbreviation for “appliance”? I don’t want to use “app builders”, because people would obviously get the wrong impression…)

I’m still looking for an appliance builder that has everything I want. Right now the three software packages on my list are Cobbler+koan, Thincrust, or maybe Kiwi.

I started looking into Kiwi a while back, but backed off because they seem to have DRY problems. Not to mention it’s written in Perl.

Cobbler and Thincrust look a little more promising, at least on the surface, but it’s hard to get a good sense of the kind of flexibility I can get out of them. It certainly doesn’t look like either of them have the ability to install a Debian/Ubuntu system without being handed the 20 lines of required pre-seed, but I could be wrong.

Does anybody have experience with these? Does anybody know if they fit the 4 features from last time, or could be hammered into fitting them?