Previous post on rough code had some notes notes on a few of the issues we faced at ü2k15. I also collected some notes and links about utf-8 and unicode that weren’t directly OpenBSD related.
This post by Solar Designer covers the history of control codes and introduces some of the challenges posed by utf-8 support. There’s also a lot of detail in the followup email by Rich Felker, and the rest of the thread as well. Some more info about ANSI escape sequences, unicode terminals, and control sequences.
I could post about a dozen links solely devoted to unicode normalization, but not to worry, RFC 3454 contains some encouraging text:
Unicode normalization requires fairly large tables and somewhat complicated character reordering logic. The size and complexity should not be considered daunting except in the most restricted of environments.
See? It’s not no bad.
Falsehoods programmers believe about names never gets old.
A personal story from the abyss. Sometimes Mojibake can be exploited for awesome. MySQL had (has?) a limit of 255 bytes per index, which causes trouble when trying to index over two VARCHAR(100) fields when the table encoding is set to utf-8, because the reserved space exceeds the limit. We want to store utf-8 so that Germans and Japanese can use the same software, but shrinking the size of the text field would mean comically long German names don’t fit. Japanese names, despite using more bytes per character, are usually much shorter. So what we really want is a utf-8 encoded field 100 bytes in size. Unfortunately, this can be hard to come by because some people have really bought into the characters only ideology, and working in bytes is wrong think. The hack solution is to configure the field to an 8 bit encoding like latin-1 and pour utf-8 bytes into it (and hope/assume there’s minimal validation). So that’s what we did. But then databases get dumped and restored and connection settings get upgraded and dozens of other things change, and suddenly the database starts helpfully translating latin-1 characters into utf-8, leading to double encoded text. Fortunately, utf-8 is easily distinguishable from latin-1, and the process can be reversed.