flak rss random www

rough code and working consensus

On their better days, standards groups follow a principle of rough consensus and working code. Somebody builds something, announces it to some friends and maybe a few competitors, and says, hey, if you build something similar, it’s possible for our implementations to interoperate. Everyone’s a winner. Sometimes the design isn’t perfect, but the fact that at least one person/group has built an implementation is an existence proof that it can be built. Valuable knowledge to have.

On their lesser days, standards groups follow a process that looks more like a political pork swap, trading favors and votes for pet features until the end result is a congealed mass of hopes and dreams. Then the committee reconvenes five years later to standardize whatever ended up getting built, trying to salvage the bits and pieces into a cohesive whole.

Coming in to the ü2k15 hackathon, some developers were agreed we should avoid a full wchar_t conversion, while others were less certain. A perfect opportunity for a rough code and working consensus process. Throw out some code, convince others of its merits (or, in turn, be convinced of its demerits), and refine our plan.

wchar_t is, of course, the official way for C to handle unicode data, though I have my doubts as to whether it was standardized on one of the better days. A few objections. Wholesale wchar_t conversions can easily rototill half the code in a program, rendering future commands like cvs annotate nearly useless. Mindless conversions don’t address most of the problems, and worse, if the result of wcslen flows into malloc without adjustment, the results can be catastrophic.

The plan was to ignore everything one is supposed to know about the problem, start with some very simple patches, then build them up in response to objections. One test subject was ls, which has to somehow turn byte sequences on disk into printable representations of filenames. There was a fair bit of wheel reinvention, although some of the early drafts were admittedly square shaped. Along the way, people would inevitably ask “Why don’t we just use wchar_t again?” and then I’d have to go find a bug in one of the wchar_t functions in libc.

For example, mblen has the interesting property that invalid inputs can “break” it, such that it remains “stuck” and will return errors even for future valid inputs until reset with a null string. The fastidious reader will note that the manual comes remarkably close to actually mentioning this possibility. A review of mblen usage in the wild, however, suggests that there aren’t very many fastidious readers. Probably not actually a problem with common encodings, but if one is going to use the all powerful all encompassing standard functions, one should be prepared for all the possible errors as well.

Speaking of errors, one of the questions we wrestled with in ls was how to handle not utf-8 sequences. Do we print an error message and abort? Probably not. We print one question mark for each unprintable (control, etc.) character, but should an over long sequence, such as using four bytes to encode a simple letter like ‘a’, print four question marks or one? There are a dozen such questions and “wchar_t” answers none of them. It doesn’t even allow asking some of them, since EILSEQ reveals so little about what went wrong. Intelligent processing requires dirty hands and an awareness of what’s happening at the byte level.

Returning to the topic of consensus code, getting a dozen people in a room together is a great way to share ideas, but also leads to some strange spontaneous mutations. It seemed like as soon as someone would ask what came next after ls, someone would start speculating about printing punycode decoded names in ntpctl. Sometimes joking, sometimes in earnest. We had a bit of a dining philosophers’ situation with multiple developers waiting around for somebody to write the “good” utf-8 utility functions, but all assuming that the API would be to their liking. Throwing out some incomplete patches would allow us to see the API taking shape, and then backfill the implementation with correct decoding.

We’re not going to use any of the code I wrote, which I’m not at all sorry about. It was only a few hours effort, and perhaps helped break the ice. People asked questions which they probably would not have had we followed standard practice by rote. And I must admit, I was slowly swayed towards taking a more principled approach. Trying to peek at the bytes without decoding anything just isn’t viable in many cases.

One argument I probably lost was regarding the end to end principle, but I believe it illustrates that the C library functions were standardized too soon, or at least are a mismatch for our purposes. ls wants to avoid corrupting the terminal, so if root (or any user, really) accidentally types ls in a naughty user’s directory, they don’t get dancing octopi in their title bar or whatever else it is that control sequences do. To do this, ls uses some shifts and ors and masks to turn a byte stream into characters, inspects them, then shifts and ors and masks them back into bytes and finally punches the output into a fake typewriter. Then tmux reads some bytes, shifts and ors and masks, does its own evaluation of the worth of each character, shifts and ors and masks, second fake typewriter. xterm reads bytes, shifts and ors and masks, scans for dancing octopi instructions, and finally puts pixels on the screen. Seems ridiculous to have so many programs do so much redundant work. There’s also the assumption that if isatty returns true, I want the output mangled in some way. But the other end of my fake typewriter may not be an xterm; it could be some other process recording my session and maybe I want those raw bytes just as they are. What does iswprint know about what I want?

What does iswprint even know at all? Consider my favorite character for my favorite food, the burrito (🌯). I have a little diary in a text file, great-🌯s.txt. What happens when I run ls in an xterm? The old 7 bit classic edition ls prints it like this: great-????s.txt With a new and improved unicode ls, it prints like this: great-s.txt xterm doesn’t know how to print burritos. It prints nothing. iswprint lied.

Unfortunately, I don’t have an answer for what should happen, but it’s a demonstration that incorrect results still happen even when following the correct procedure. Unicode and wide characters may be an improvement over the dark ages of the 8 bit code pages, but the rough consensus seems have arrived before the working code. It could also just be a bug in xterm. Either way, I remain unconvinced that utf-8 validation in every program is the right path.

The wide char and multibyte functions landed in the C standard because at some point there was consensus they would work. The interfaces are theoretically general enough (in some cases, crazy generalized) to allow solving all problems, even if that means it’s quite a bit of work to solve any particular problem. And yet the answers to some questions, like “dude, where’s my burrito?”, remain frustratingly out of reach.

Thinking about this a little more, I might characterize the problem as specific answers to vague questions and vague answers to specific questions. Vaguely, is this multibyte string printable? Specific answer, yes. Specifically, will this utf-8 encoding of a burrito display correctly or should I perhaps transliterate it as {BURRITO}? Vague answer, depends.

Lots more on utf-8 and unicode collected in utf-achtung.

Posted 17 Nov 2015 14:48 by tedu Updated: 19 Nov 2015 06:31
Tagged: openbsd programming thoughts