software reliability and kernels

OpenBSD 5.1 preorders are now up on the site. And minix 3.2 was also released recently. Here are some reflections on why microkernels don’t matter. I’m not saying microkernels are worse, or slow, or anything like that; they just aren’t that much better. There’s a somewhat long two part argument coming up. One part is that microkernel reliability is overstated, the second part is that the value of reliability is itself overrated.

The general idea is that a microkernel provides protection by moving all the hard to get right driver code to userland, leaving only a tiny easily audited and verified kernel running with full system access. Actually, it’s easier to have Tanenbaum explain why he likes minix. It’s also worth reading the cited Stanford study because I’ll be referring to it.

bugs in drivers

No argument that the majority of kernel bugs are found in drivers. The question is whether we care. The minix argument makes several assumptions, chiefly that errors in code result in observable runtime misbehavior and that such misbehavior can be detected and corrected by a microkernel.

Here’s an example: *p = x; if (p == NULL) .... The bug, of course, is that we are checking for a null pointer after having already dereferenced it. This can only affect the user, however, if the pointer can be null at runtime. There may be other constraints or preconditions that eliminate that possibility, in which case the check is the bug, not the dereference. This error occurs very frequently in drivers because they’re written by all sorts of people who don’t really know what’s possible and what’s not and then make a misguided attempt at defensive programming. Not every null pointer bug is of this type, many are legit crashable errors, but it’s worth remembering the excessive check bugs are a big part of the cited study.

Another bug is calling a blocking or sleeping function when you aren’t supposed to because it can lead to a delay or slowdown. If you’ve ever had the interface for an application stop responding to input for a few seconds while something happened in the background, you’ve experienced this bug. (The bug was more likely in the app than the kernel; don’t think a perfect kernel will suddenly make the programs you actually care about work better.) There’s not much a microkernel is going to do here. If a driver fails to respond to events in a timely manner, you can’t force it to do so.

Yet another bug is a memory leak. In a driver, this shows up fairly frequently if an error occurs during initialization. A few remarks about that. Initialization “never” fails. The amount of memory leaked is typically small, and definitely finite if initialization is a one time operation. A microkernel can try reloading the driver, but the initial error (almost certainly a mismatch between hardware and software revisions), is just going to recur, and you’ll just keep leaking memory.

Let’s talk about another type of bug that’s particularly pernicious. A network driver, through some logic error, fails to program the ethernet chip correctly such that it’s not possible to receive new packets. This is not a crash causing bug. It happens, then things go silent, and then a while later you wonder if something is wrong. What’s a microkernel going to do? ifconfig down; ifconfig up to reset the interface? Basically, yes, but I can do that today.

Further along these lines, there’s an assumption that bugs are distributed across drivers evenly. Popular drivers are tested more and have fewer bugs. Speaking as an OpenBSD developer, there are certain models of laptop (Thinkpads) that work pretty flawlessly, and other models that work less well. This has something to do with hardware choices, but also a lot to do with the fact that practically every developer owns a Thinkpad. Bugs are found, bugs are fixed.

That’s part one. I’ve split part two into a separate post because it’s not strictly about kernel architecture.

Posted 06 Apr 2012 23:43 by tedu Updated: 09 Mar 2013 18:34
Tagged: software thoughts