Complex systems do not always fail for complex reasons. They quite often fail for the absolutely dumbest possible reasons.
- nelhage
Like say, for instance, when you’re debugging why one of your servers takes 7 seconds to do what the other server can do in less than 1.
root@black-mesa:~# time lvcreate -n test -L 1G xenvg Logical volume "test" created real 0m0.309s user 0m0.000s sys 0m0.008s root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg Logical volume "test" created real 0m7.282s user 0m6.396s sys 0m0.312s
Sometimes it turns out to just be that the directory it’s logging to takes 7 seconds to list:
root@black-mesa:~# time ls -a /etc/lvm/archive/ >/dev/null real 0m0.005s user 0m0.000s sys 0m0.004s root@torchwood-institute:~# time ls -a /etc/lvm/archive/ >/dev/null real 0m7.007s user 0m6.644s sys 0m0.364s
And occasionally, that’s not just because your disk is failing or you’re running into caching issues. Occasionally, it’s just because that directory somehow has hundreds of thousands of files in it:
root@torchwood-institute:~# ls -a /etc/lvm/archive | wc -l 301369
And very, very rarely, if the gods are smiling on you, deleting those hundreds of thousands of files causes things to work again.
root@torchwood-institute:~# find /etc/lvm/archive -name '.lvm_torchwood-institute.mit.edu_*' -delete root@torchwood-institute:~# time ls -a /etc/lvm/archive >/dev/null real 0m0.015s user 0m0.000s sys 0m0.012s root@torchwood-institute:~# time lvcreate -n test -L 1G xenvg Logical volume "test" created real 0m0.341s user 0m0.000s sys 0m0.016s root@torchwood-institute:~# time lvremove -f /dev/xenvg/test Logical volume "test" successfully removed real 0m0.226s user 0m0.004s sys 0m0.012s
It kind of sucks to spend a week on and off, talking with developers, trying to figure out what’s going on. But it’s also really nice when it turns out to be something I can just fix myself.