flak rss random

against testing

I really dislike writing tests. There’s some amount of discomfort I’d be willing to sustain if I felt it they were beneficial, but I also find they’re rarely worth the bother. Some reasons why. Most of this probably applicable specifically to unit testing, but some other bits to integration testing.

In order to be effective, a test needs to exist for some condition not handled by the code. However, typically when the same mind writes the code and the tests, the coverage overlaps. Errors arise from unexpected conditions, but due to their unexpected nature, these are the conditions which also go untested.

One way I like to consider this is to look back after a bug is found, and ask how many tests would have been needed to detect it. The minimal answer is of course one, but we can’t reasonably believe that this exact test would be the next one added. Instead, ask if we had doubled the testing, would we have found this bug? Doubled again? How many doublings of the testsuite are required to find each undetected bug?

If tests are not to preemptively detect bugs, but to prevent regressions, we have two cases. Preventing the first occurrence of a regression, during a rewrite or refactor, is the mostly the same as the previous scenario, in which we must anticipate the ways in which the refactor may go wrong, and suffer the same lack of coverage.

We also wish to prevent the recurrence of a fixed bug. The first occurrence of a bug is proof that it can happen, not just theoretically, and we should aim to keep it from recurring. Really embarrassing when a customer reports the same problem twice. In practice, I find this is a fairly small percentage of all bugs, and also indicative of other process problems, but it’s reasonable to test for. Still, I’d want to ask why we made the same mistake twice, not just why we failed to catch it.

Even tests which deliberately aim for edge conditions have a tendency to miss. An off by one test will fail to properly detect an off by one in the code. Testing x < limit and y < limit will have 100% coverage without testing x + y < limit.

Tests are very brittle, breaking due to entirely innocuous changes in the code because they inadvertently embed very specific implementation dependencies. Test for a race condition? Actually a test of the scheduler’s round robin algorithm.

Every test breakage then requires an investigation to determine what’s really being tested, in addition to and besides the claimed history of the test. This is equal in effort to determining what function a block of code actually performs, except the understanding gained is not useful for future development.

And then there’s the discussion whether the implementation dependent test isn’t in fact helpful because what if external users of the code have also come to depend on this behavior, despite its undocumented nature. If it’s important enough to test, it should be documented. But then that idea gets pushed back, because we need to keep it undocumented in case we want to change. Except that’s exactly what we’re trying to do if it weren’t for the interfering test.

In the other direction, the test very reliably checks that this one function works correctly, but not that it is used correctly. Returning a null result for a null input is easy to test, but the problem is the surrounding code calling the function, not the function under test.

Tests are also dependent, in a good way, on the testing platform, but this requires they be run on the affected platform as well. A test that checks a function is endian neutral will fail to catch mistakes if it’s only ever run on little endian systems.

Other tests grow obsolete because the target platform is retired, but not the test. There’s even less pressure to remove stale tests than useless code. Watch as the workaround for Windows XP is removed from the code, but not the test checking it still works.

Not all code is immediately amenable to testing, which requires changes to the code, possibly introducing several additional problems. The changes themselves may break the code. The added abstraction, to replace functional components with testing components, obfuscates the code. And finally, what happens when the testing only RNG with a static seed or the login bypass with a magic password finds its way into the shipping product?

Certainly, one approach is to just write good tests that don’t have these problems. If you like writing tests and are very good at it, you may continue to do so.

I’ll note that my poor experiences with testing were not the result of writing deliberately bad tests and being unhappy with the result. The tests mostly came from other developers who thought they were helping. People who see a lot of theoretical value in tests seem to have difficulty assessing the practical value of the tests they’ve actually delivered. This is true of many endeavors, but I think useless tests collect a great deal of inertia. What’s the harm in running a useless test or million?

I thought Dan Luu’s reaction to the reactions was interesting. I don’t know precisely what he thinks of this post, though I imagine he’d disagree (and he’s probably right), but I’m fascinated that people who like testing don’t agree what it’s for. This undoubtedly colors my perception of not liking it.

Finally, I should mention I quite like fuzzing, which does seem to uncover unexpected bugs.

case study

This is not meant to be exhaustive, but since I was dealing with a pair of fairly dumb bugs in about the same timeframe as this article was written, I thought they may be illustrative.

My mail client was sending emails with invalid Message-ID headers, ones without a domain component. This seems trivial to test for, however the root cause was that the config was missing the value for the domain name. The setup function always (currently) sets this value when creating the database, and so such a test would always pass in a test environment. Unfortunately, the database I’m actually using has a more tortured origin story with some adhoc hand conversions from another format.

The second bug also dealt with the Message-ID header. In some circumstances, reediting a draft could duplicate the header, due to some unforeseen interactions between the code for serialization and preparing a draft for editing. What test would have found this? The serialization code worked. The deserialization code worked. The code to prepare a draft worked. They even all worked together. Just not twice. Without apriori knowledge this could fail in this particular way, would I have tested it? Not likely.

The true root cause, imo, for both bugs was that I was switching servers in a hurry, had a prototype for a new design I’d been noodling around with, and didn’t want recreate my old working email setup just to replace it. So the prototype was pressed into service a few months before I would have switched had it not been for the server migration. That was a much more serious process failure than missing some tests.

related

The story of the one line fix. I think the intended moral is actually rather different, but I liked it.

Posted 07 Jul 2020 00:36 by tedu Updated: 22 Dec 2020 18:22
Tagged: programming thoughts