flak rss random

software reliability and utility

Following up on the previous post about software reliability, let’s consider the concept of utility functions. There’s a few definitions for utility function, but I’ll say it’s something along the lines of a function that maps the raw measure of something to its value to the user. A common example given for money is the natural log function, which explains why a dollar to a millionaire is worth less than a dollar to a poor person, even though the raw measure ($1) remains the same: log(11) - log(10) > log(1e6 + 1) - log(1e6). Law of diminishing returns.

If we measure software reliability in terms of mean time between failures (MTBF), my proposal for a utility function is floor(log10(MTBF)). Not just diminishing returns, but it’s also a step function. Anything less than a 10x increase just doesn’t matter. So the argument is not that a 2x or 4x reliable microkernel is only a little more valuable, it’s that it literally has no additional value.

Of course, if a microkernel were 10x more reliable, that would make it significantly better than existing systems. I just don’t believe it’s 10x. I don’t think it’s possible for it to be 10x better, because existing operating systems are already close to the limit of hardware reliability. If you want uptime, you throw hardware at the problem. You build a cluster. You leave the world of having a computer behind.

floor(log10(MTBF))

I haven’t justified this function. I don’t have a lot of evidence, but it feels about right. Pretend you have some necessary app you need for your business to run, but it crashes and refuses to run on Tuesday and Thursday (sadly not as absurd an example as it sounds). Ignoring weekends, that’s an uptime of 60%. Five nines, it is not. The MTBF is about 36 hours. This is not great software, but you’re getting by.

Now a new version is released and it promises to run on Tuesday. Woohoo, uptime is now 80%, a 33% improvement. MTBF is now 96 hours, a whopping 167% improvement. Do you care? Probably not. You still have to remember when the program doesn’t work and schedule around that.

Totally contrived example, but I think it demonstrates the main point that software’s reliability and your own reliance on the software are correlated. Until the reliability improves to the point where you can rely on it in a fundamentally different way, it doesn’t matter. Reliance, not reliability, is the true measure of software’s value.

Posted 06 Apr 2012 23:45 by tedu Updated: 09 Mar 2013 18:19
Tagged: software thoughts