utf-achtung

Previous post on rough code had some notes notes on a few of the issues we faced at ü2k15. I also collected some notes and links about utf-8 and unicode that weren’t directly OpenBSD related.

This post by Solar Designer covers the history of control codes and introduces some of the challenges posed by utf-8 support. There’s also a lot of detail in the followup email by Rich Felker, and the rest of the thread as well. Some more info about ANSI escape sequences, unicode terminals, and control sequences.

Strings in swift has some nice examples of utf-8 slicing and dicing. The suckless conference featured a presentation on utf-8 and unicode.

UTF-8 Everywhere.

I could post about a dozen links solely devoted to unicode normalization, but not to worry, RFC 3454 contains some encouraging text:

Unicode normalization requires fairly large tables
and somewhat complicated character reordering logic.
The size and complexity should not be considered
daunting except in the most restricted of environments.

See? It’s not no bad.

Falsehoods programmers believe about names never gets old.

Assorted issues. Dark corners of unicode. When monospaced fonts aren’t. unicode mimicry. JavaScript has a unicode problem. There are some interesting nuggets in Unicode: good, bad, & ugly but it requires navigating s5 slides.

A personal story from the abyss. Sometimes Mojibake can be exploited for awesome. MySQL had (has?) a limit of 255 bytes per index, which causes trouble when trying to index over two VARCHAR(100) fields when the table encoding is set to utf-8, because the reserved space exceeds the limit. We want to store utf-8 so that Germans and Japanese can use the same software, but shrinking the size of the text field would mean comically long German names don’t fit. Japanese names, despite using more bytes per character, are usually much shorter. So what we really want is a utf-8 encoded field 100 bytes in size. Unfortunately, this can be hard to come by because some people have really bought into the characters only ideology, and working in bytes is wrong think. The hack solution is to configure the field to an 8 bit encoding like latin-1 and pour utf-8 bytes into it (and hope/assume there’s minimal validation). So that’s what we did. But then databases get dumped and restored and connection settings get upgraded and dozens of other things change, and suddenly the database starts helpfully translating latin-1 characters into utf-8, leading to double encoded text. Fortunately, utf-8 is easily distinguishable from latin-1, and the process can be reversed.

Sometimes the above happens by accident. There’s an entire post devoted to it, fixing common unicode mistakes which gave rise to the ftfy python package.

Some utf-8 coding exercises. Building a fast utf8_strlen. Decoding utf-8 with a DFA.

Posted 19 Nov 2015 06:30 by tedu Updated: 11 May 2016 00:30
Tagged: software