Tags

Computers process numbers using arithmetic and logic (which amount to the same thing). Processing text, however, requires at least two levels of abstraction. Firstly, a definition of a textual atom — informally a character. Secondly, a definition of a textual unit — typically called a string. A string is an ordered list of characters.

Therein lies a whole field of computer science. From a practical point of view, implementing text has become much easier with Unicode. Different character sets was one of the more awful aspects of dealing with text. (Remember CP-1252?)

While text processing really does boil down to strings made of characters, of course it’s a lot more complicated than that. As I said, it’s a whole field of CS. I’m not going to get into that at all in this post. Here I just want to talk about the atoms — the characters. They’re plenty complicated enough all on their own.

In particular I want to talk about Unicode, a universal standard for representing the characters that make up strings that applications use to process text. It’s an open-ended clever approach to an old computer problem. (If only we could do something equally clever with dates and times.)

§

The basic idea is a map between the characters we want to represent and the numbers the computer uses. This raises immediate physical limits — computers have a limit on how large individual numbers can be. Worse, transmission of computer numbers often involves comparatively small individual numbers (typically just binary octets).

Part of the challenge of the character-number map is that text atoms (characters) and computer atoms (numbers) don’t always match as nicely as we might hope.

In the abstract, we map A=1, B=2, C=3,… and so on for all the characters we need. That scheme worked fine with Western languages. An alphabet of 26 two-case characters, ten digits, a bunch of punctuation symbols, some control codes for tabs and line endings and such, and that’s only 100 mappings or so. Those octets can count to 256, which allows room for fancy characters for drawing.

When the early dust settled, most systems in these parts used ASCII (“ass-key”), the American Standard Code for Information Interchange. Note that leading term: American. It totally was. But European languages needed slightly different mappings for ácçentéd characters and other aspects literally foreign to USAnians.

So many variations sprang up. Computers that had to handle text created under different mappings often used a protocol called a code page. This mainly mattered when displaying characters — we want the numbers to map to the correct glyphs. However some languages have special rules or special character handling, so code pages can also apply to character handling.

The situation became more complicated with Eastern languages, which had more characters than octets could count. Tens of thousands are typical, some have many more. Early schemes allowed semantic shift codes and/or multi-byte formulations to fit such large character sets.

It was, per the old curse, an interesting time for developers and administrators, even for users. It was often impossible to display a document from a different country.

§

And then along came Unicode. Change is hard, and there were growing pains, but at this point code page pain is pretty much a thing of the past. Unicode has been so successful it’s become a worldwide default.

The basic idea is simple and can be stated in less than 25 words:

Unicode seeks to map all glyphs of all languages to unique integer Codepoints along with various 32-bit, 16-bit and 8-bit physical encodings.

That’s 22 words from a white paper The Company asked me to write as our IT department was in growing pains over Unicode. The symptoms of their pain were appearances like this:

â€śÂˇHola! â€” ÂżQue pasa?â€¦â€ť

That ought to look like this:

“¡Hola! — ¿Que pasa?…”

The problem is copy-pasting text between apps that don’t treat the UTF-8 Unicode string like a UTF-8 Unicode string. That’s what that first version is — not line noise, but special sequences you’re not supposed to see. They are supposed to be interpreted as the (only slightly) fancy characters in the second version.

When that happens to text from Eastern languages, it often looks entirely like line noise:

U¨µVv¨Õ¯aY×rWq®WØþU¾hVºÆU¬»Õ¯¤

But should look like (if you don’t see Chinese characters, check your display and browser settings):

Which according to Google Translate is “Hello? What’s Up?” in Chinese (Simplified).

The trick is understanding what’s happening and why. That requires knowing a bit about Unicode.

§ §

So Unicode seeks to bring all the characters of the world under its umbrella. All means all, as inclusive as possible. It’s a lot of different characters. Unicode handles this by defining four layers of abstraction.

Firstly, the Abstract Character Repertoire (ACR). This consists of the unordered set of glyphs included in the standard. As the name says, this layer is abstract. The glyphs are defined in terms of language. For instance, when new emojis are added to the standard, they’re first added here as descriptions (“stop button” or “sailboat”).

Secondly, the Coded Character Set (CCS). This layer assigns unique non-negative integer values — called code points —  to each of the included glyphs in the ACR. Note this is also an abstract layer. It’s just a map of glyphs to numbers with no reference to the size of those numbers. The only requirement is they begin with zero. Gaps are allowed; the numbering doesn’t have to be contiguous.

Thirdly, the Character Encoding Form (CEF). This layer maps the abstract CCS numbers to physical machine widths. This layer defines how Unicode is represented in different machine forms. When discussing how Unicode is stored in a Database or in application code, the discussion usually involves an Encoding Form.

Fourthly, the Character Encoding Scheme (CES) maps the CCS onto 8-bit byte streams. Schemes typically matter when considering file storage and network transport, both of which are typically byte-oriented.

To illustrate the difference between CE Forms and CE Schemes, compare UTF-16 to UTF-16LE and UTF-16BE. The first one is a Form that maps codepoints to 16-bit values. The second two are Schemes that involve the same map, but due to hardware requirements, must specify “little endian” or “big endian” byte order respectively. A similar difference exists between UTF-32 and UTF-32LE and UTF-32BE. The Scheme matters when it comes to computer architecture, especially memory, but it matters in anything that treats larger numbers in chunks of octets.

§ §

There aren’t too many 16-bit systems any more (we’re well on our way to 64-bit systems these days); 32-bit systems are fairly standard. UTF-32 is a natural fit for 32-bit systems. Unicode requires 21-bits to encode the whole CCS, so it fits easily into UTF-32; it’s the “native” encoding Form for Unicode. The UTF-32 codes are just the Unicode code points. These days only über-geeks need to deal with how the machine treats Unicode.

But communication between systems is generally based on octets; the internet is based on octets. That makes UTF-8 the most common text format in town. It’s well worth any programmer taking the time to understand how it works. It uses a very clever way of stuffing 21-bit characters into 8-bit chunks.

§

Unicode does one very American thing: it preserves the original 128 7-bit ASCII codes. These map directly to their code points. For instance, an ASCII “A” has a bit pattern that evaluates to 65, and its Unicode code point is also 65. This means old ASCII files can be read as if they were Unicode files.

The gotcha is that most systems use 256-code 8-bit ASCII, which doubled the characters, but defined the extra ones ad hoc. Those sleek looking quote characters, for instance, are from the 8-bit set, not the 7-bit set. Text with 8-bit ASCII can sometimes be a problem. If you’ve ever had the apostrophes in contractions change to something weird, that’s why.

But the first seven bits map directly to UTF-8 bytes with the same 128 7-bit codes. The key here is that the eighth bit is zero. When that bit is one, things change. Here’s the basic breakdown (bits shown on left, hexadecimal byte values in middle, description on right):

```0bbb.bbbb - 00-7F - ASCII 7-bit codes
10xx.xxxx - 80-BF - UTF-8 sequence continuation bytes
11{n}0{x} - C0-FD - UTF-8 sequence begin byte```

The last one takes the most explaining. The second one will be more obvious once you understand the last one. The key to both is the eighth bit set to one; that indicates they’re part of a sequence of two or more (up to six) bytes containing the bits of a Unicode character spread across them (up to 21 bits).

The key difference is in the seventh bit. The continuation bytes have their seventh bit set to zero. In them, the remaining six bits are payload bits — the bits of the Unicode character encoded in the sequence.

If the seventh set to one it indicates the beginning of a sequence. The {n}0 means there are zero-or-more additional bits set to one followed by a bit set to zero. The {x} indicates that any remaining bits are available for payload. Here’s the trick: the number of leading one-bits is the number of bytes in the sequence.

Since there can be up to six sequence bytes, there are six possible starting bytes:

```110x.xxxx - C0-DF - two-byte sequence
1110.xxxx - E0-EF - three-byte sequence
1111.0xxx - F0-F7 - four-byte sequence
1111.10xx - F8-FB - five-byte sequence
1111.110x - FC-FD - six-byte sequence```

Each followed by the appropriate number of continuation bytes. This is determined by how large the code point is, leading zeros of the 21-bit value are stripped. The most significant one-bit determines the number of bits needed.

The greater-then-seven-bit ASCII code points, the vast bulk of the Unicode character set, map to UTF-8 like this (code points are in hexadecimal):

```from 0000.0080 to 0000.07FF - two-byte sequence
from 0000.0800 to 0000.FFFF - three-byte sequence
from 0001.0000 to 001F.FFFF - four-byte sequence
from 0020.0000 to 03FF.FFFF - five-byte sequence
from 0400.0000 to 7FFF.FFFF - six-byte sequence```

So it can handle codes up to 31 bits, which gives it plenty of room for expansion as Unicode grows. At its current size, 21 bits max, the longest possible UTF-8 sequence is four bytes.

§

We can put this all together with a simple example. Let’s take the case where “I didn’t imagine that!” mysteriously turns into “I didnâ€™t imagine that!” What happened to that slick-looking apostrophe, and what are those weird several characters that replaced it?

It’s because that slick apostrophe isn’t an ASCII character, it’s a Unicode character, so when using UTF-8 it has to be encoded. We start with the Unicode code point for RIGHT SINGLE QUOTATION MARK, which is U+2019 (in hexadecimal; note the special formatting for Unicode code points) — the decimal numeric value is 8217. More importantly, the binary value is 10.0000.0001.1001, and those 14 bits are what we need to spread across multiple UTF-8 sequence bytes.

From the chart above, 2019 is between 0800-FFFF, so this will be a three-byte sequence. So the first byte has the form 1110.xxxx and the two following bytes have the form 10xx.xxxx. The resulting 16 “x” bits are populated with the codepoint bits starting at the least significant bit (far right). The unused two “x” bits are set to zero. The result is (bits on left; hexadecimal bytes on right).

`1110.0010 1000.0000 1001.1001 - E2 80 99`

Those three bytes — interpreted as some form of extended ASCII — appear as the strange “â€™” characters that replace the nice apostrophe.

You can of course reverse the process to recover the original Unicode character from a sequence of UTF-8 bytes.

§ §

And that’s pretty much what UTF-8 is all about. There’s a detail about byte order marks (BOM) that I won’t get into here. Maybe another time.

Be thankful for Unicode. It made at least one aspect of using computers a whole lot easier and vastly more inclusive for all. A very nice win!

Now if we only had Unidate…