two and a half bad bits
It started with a simple feature addition. It always does. And then the murders began. I don’t think I’ve ever introduced so many bugs by changing so few bits.
the good bit
We want to add one new feature, to allow the admin to specify the rtable in login.conf. It’s not a very big change, barely more than three lines, to add this feature to setusercontext, the function used by most login code.
In addition to the code, we add one new bit, LOGIN_SETRTABLE, to the mask of all available bits, LOGIN_SETALL, which will automatically upgrade programs which call this function. So simple, so clean.
the bad bit
Our first complication is that doas does not use LOGIN_SETALL, preferring an artisanally curated selection of bits. But maybe it doesn’t need to. If we can convert doas to use all the bits, we won’t have to worry about compatibility with old or new header files. New headers, new bits; old headers, old bits. A quick review of the setusercontext manual reveals that this will add the LOGIN_SETENV bit, which won’t do much because doas has its own environment setting code, but it seems harmless. Even makes the code shorter!
LOGIN_SETENV seemed harmless enough, but it doesn’t take long for the first problem to arise. The doas regress fails, because it runs chrooted without a login.conf file. Why is this a problem when it’s never had a login.conf file before? Because the LOGIN_SETENV code fails in that case, instead of continuing quietly. This code was never exercised because none of the programs using this bit have ever been run in a chroot before now, leaving the latent bug lurking in libc. This is an easy fix.
the very bad bit
Every change to existing code is going to surface a hidden bug, that’s almost to be expected, but once we fix it we should be all set. Except no, now people are noticing that doas is calling setlogin, for which the manual contains a number of dire warnings about precautions you must take beforehand. doas doesn’t do any of those things, so it certainly should not be calling setlogin.
Alas, I totally failed to see the magic LOGIN_SETLOGIN bit in the setusercontext manual. This was just dumb. Maybe I should learn to count.
the half bad bit
Time to revisit the new bit we started with, LOGIN_SETRTABLE. The good news is it works as intended. The bad news is it doesn’t work as desired. If you run su or doas after switching rtable, the default behavior is now to switch back to rtable 0. This caused some collateral damage, making admin life more difficult. The fix is to only set the rtable if specified by login.conf, not by default.
This caught me by surprise, because it’s unlike the behavior with LOGIN_SETPRIORITY. In my mind, setrtable and setpriority seem like similar operations, but the desired semantics here are that they be inherited differently across logins.
Never change anything. You can only make it worse.
Most of the trouble came from my decision to switch doas to LOGIN_SETALL instead of simply adding the one bit I really wanted. I was seduced by the allure of the “elegance” of code that would be forwards and backwards compatible, but really, what practical gain was there to be had? It’s dumb in hindsight, but even at the time, I couldn’t have named a tangible benefit. Should have just stuck to simple, working code without trying to be clever.
I would say LOGIN_SETENV failing without login.conf is your run of the mill drinking age library bug, but that’s only true in Europe.
My thanks to all the OpenBSD developers who mopped up the mess.
Rewrite setusercontext in a language that doesn’t allow setting unwanted bits.