Episode 3.09 – UTF-8 Encoding and Unicode Code Points

Welcome to the Geek Author series on Computer Organization and Design Fundamentals. I’m David Tarnoff, and in this series we are working our way through the topics of Computer Organization, Computer Architecture, Digital Design, and Embedded System Design. If you’re interested in the inner workings of a computer, then you’re in the right place. The only background you’ll need for this series is an understanding of integer math, and if possible, a little experience with a programming language such as Java. And one more thing. Our topics involve a bit of figuring, so it might help to keep a pencil and paper handy.

Have you ever wondered why your word processor or web application placed an odd-looking character, an empty box, or a numeric code where a character should be in your document? Since computers are only capable of working with numbers, that document was delivered as a sequence of binary codes, and sometimes, the encoding scheme used by the document’s creator doesn’t map properly to the encoding scheme used to read the document.

In episode 3.8, we discussed how computers store symbols such as letters and punctuation by assigning a unique number to each symbol. We also presented one character encoding scheme called ASCII that used a 7-bit pattern to represent twenty-five punctuation marks and mathematical symbols, the space, the ten decimal digits, the twenty-six uppercase and twenty-six lowercase letters of the Latin alphabet, and thirty-eight unprintable control codes defining things like a carriage return, a delete, an escape, and an audible bell.

Since ASCII was standardized, the character encoding requirements for computing applications has grown well beyond what ASCII could support. The internationalization of computing products and services required encoding schemes to represent characters from languages other than those written using the Latin alphabet. As other forms of digital exchange such as text messaging began to appear, encoding schemes also needed to include non-alphanumeric characters such as those tiny digital images we call emojis. By the time a need was recognized for expanded character encoding schemes, ASCII was well-entrenched, and a conversion to a more robust encoding method had the potential to break a lot of existing code.1 We will see later how the fact that ASCII began as a 7-bit encoding method was a blessing in a world that typically stores data elements in 8-bit memory locations.

In March of 2019, The Unicode Consortium released Unicode Version 12.0. To give you an idea of the extensiveness of this standard, it builds upon earlier versions by adding characters sets including four new scripts used to write languages such as historic Aramaic and Sanskrit and 61 new emoji characters including a yawning face, a person using sign language, a guide dog, a skunk, and a safety vest.2 Unicode, by the way, is not itself an encoding scheme. Unicode is an enormous set of characters, control codes, and symbols that have been assigned identifiers. These identifiers are referred to as code points. In the Unicode codespace, the range of integers representing code points goes from 0 to 0x10FFFF allowing for over a million code points.3 The conventional notation to represent a Unicode value is a capital U followed by a plus sign followed by the integer value in hexadecimal. Typically, this hexadecimal value is four digits long.

Let’s get back to the encoding. There are two ways to create an encoding scheme to address this increase in the number of code points: either expand the number of bits used to represent a code point so that all code points have the same fixed size or develop a scheme that supports a variable number of bytes to represent a code point. Each has its benefits, but one of the most commonly used schemes is variable length because of a concept called backward compatibility. A system or scheme is backwards compatible if it is capable of exchanging information with older legacy systems. For any expanded encoding scheme to be backwards compatible with ASCII, it must be capable of interpreting ASCII code points and getting identical results. This means that a backwards compatible scheme must use a variable number of bytes so that it can support the original ASCII code points stored as single bytes along with the vast array of new code points required by today’s software.

One of the most popular modern encoding schemes is a variable width encoding called UTF-8. UTF-8, which is only one of the Unicode Transformation Formats, encodes the one million plus possible code points defined by Unicode to binary patterns of one, two, three, or four bytes that can then be stored or transmitted. The first bit of a UTF-8 pattern acts as a flag. If this flag is a zero, then the following seven bits are stored as ASCII 7-bit encoding. By matching the first 128 codes to the patterns defined by 7-bit ASCII, UTF-8 correctly interprets any data encoded in 7-bit ASCII, and hence, is backwards compatible. As a side note, HTML files are typically stored in plain text using the Latin alphabet. That means that a web page encoded in UTF-8 will transmit in half the time as the same file encoded with UTF-16 (a two- or four-byte variable-length encoding) and a quarter of the time as the same file encoded with UTF-32 (a four-byte fixed-width encoding).1

If the first bit of the pattern is not a zero, then both the first and second bit of the pattern will be one. We will describe why both bits must be one a little later. If the UTF-8 pattern is two bytes long, then the pattern will start with two ones followed by a zero. A three-byte UTF-8 pattern starts with three ones followed by a zero. A four-byte UTF-8 pattern starts with four ones followed by a zero. The bits in the first byte that follow the zero are available to store the most significant bits of the code point. Every byte of the pattern that follows the starting byte will begin with a binary one zero. This is to distinguish it from any of the possible first byte patterns. This means that the second, third, or fourth byte of a UTF-8 pattern each has six bits available to store a code point.

For example, the two-byte UTF-8 pattern always has a first byte that starts with 110 followed by five code point bits and a second byte that starts with 10 followed by six code point bits. This gives us five plus six or eleven bits for the code point. 211 equals 2,048, which means that there are 2,048 possible patterns of ones and zeros that can be used with two-byte encoding. Since single byte encoding takes care of the code points from 0 to 127, two-byte encoding takes care of the code points from 128 to 2047, or 0x80 to 0x7FF.

The three-byte UTF-8 pattern always has a first byte that starts with 1110 followed by four code point bits. The next two bytes both start with 10 followed by six code point bits each. This leaves four plus six plus six or sixteen bits for the actual encoding. 216 equals 65,536, which means that there are 65,536 possible patterns of ones and zeros that can be used with three-byte encoding. Since one- and two-byte encodings take care of the code points from 0 to 2,047, three-byte encoding takes care of the code points from 2,048 to 65,535, or 0x800 to 0xFFFF.

The four-byte UTF-8 pattern always has a first byte that starts with 11110 followed by three code point bits. Each of the next three bytes start with 10 followed by six code point bits. This leaves three plus six plus six plus six or twenty-one bits for four-byte encoding. 221 equals 2,097,152, which means that there are 2,097,152 possible patterns of ones and zeros that can be used for the encodings. Since one-, two-, and three-byte encodings take care of the code points from 0 to 65,535, four-byte encoding takes care of the code points beyond that. In hexadecimal, Unicode places the range of four-byte encodings from 0x10000 to 0x10FFFF.

Note that for two-, three-, and four-byte UTF-8 patterns, every byte coming after the initial byte begins with 10. Since one-byte UTF-8 always starts with 0 and two-, three-, and four-byte UTF-8 patterns always start with 11, it is impossible to mistaken one of the subsequent bytes as a first byte of a pattern. This allows us to quickly synchronize a byte stream within a single code point in case we’ve lost our position.4 It also means that we can receive the bytes in reverse order and still decode the message by assembling the bytes starting with 10 until we receive the byte starting with 11 which will tell us how to interpret the data. We just need to know ahead of time that the bytes will be sent in reverse order.

It’s time for some examples. First, let’s see what the mathematical symbol “greater than or equal to” looks like in UTF-8. We begin by determining its Unicode code point. If we go to the mathematical operators Unicode code chart, we see that the hexadecimal code for greater than or equal to is 0x2265.5 By examining the ranges for one-, two-, three-, and four-byte UTF-8 patterns, we see that 0x2265 falls in the three-byte range, which is from 0x800 to 0xFFFF. That means we need to fill in the sixteen available bit positions of the three-byte pattern with the sixteen least significant bits of the code point 0x2265, which are 0010 0010 0110 0101.

The rightmost byte of the three-byte pattern will take the six least significant bits of the code point, which in this case are 100101. That means that the rightmost byte of the pattern will start with 10 and be followed by the bits 100101, giving us the byte 0xA5. Next, we look at the middle byte of the pattern. This UTF-8 byte will also start with 10. The 10 is then followed by the next six least significant bits of the code point, which are 001001. This gives us the value 0x89 for the middle byte. The most significant UTF-8 byte starts with 1110, which is followed by the four most significant bits of the code point: 0010. This gives us the value 0xE2 for the most significant byte of the UTF-8 pattern. Therefore, the UTF-8 pattern for the Unicode code point 0x2265 is 0xE289A5.

As a second example, let’s examine the UTF-8 encoding 0x32CF8072 and decode it to its Unicode code point characters. The binary pattern of the first byte, 0x32, is 00110010, which starts with a zero. That means that the UTF-8 pattern is a one-byte pattern and maps to the 7-bit ASCII code points. In ASCII, 0x32 equals the character 2. The second byte is the first byte of our next UTF-8 encoded character. 0xCF is 11001111 in binary. Since we know that any UTF-8 pattern starting with 110 is a two-byte pattern, we know that the code point represented here is an eleven-bit pattern between 0x80 and 0x7FF. The first 5 bits of this eleven-bit pattern are the five bits following 110 in 0xCF, which are 01111. The last six bits of the eleven-bit pattern are the last six bits of the next byte, 0x80, which are 000000. That means that the binary of the Unicode code point represented by the UTF-8 pattern 0xCF80 is 01111000000, which is 0x3C0. A quick search of the Unicode codes reveals that this is the code point for the lowercase Greek letter pi. The last byte, 0x72, in binary is 01110010. Since the first bit is zero, we know that once again, this is in 7-bit ASCII. In ASCII, 0x72 is the lowercase letter ‘r’. That means that the four bytes of our example represent 2πr in UTF-8.

By the way, UTF-8 does have its drawbacks. In many cases, the variable length nature of UTF-8 requires all characters of a string to be decoded before performing operations such as searches or comparisons. And although UTF-8 allows for efficient encoding of the Latin alphabet, for Asian scripts such as Japanese Kanji, the value of the code points requires three-byte UTF-8 patterns whereas UTF-16 would only require two bytes. Typically, the savings is not worth it due to the universal acceptance of UTF-8.

In our next episode, we are going to look at a very different kind of encoding: line encoding. Line encoding defines how transmission media represent the ones and zeros of our data. For transcripts, links, or other podcast notes, please check us out at intermation.com where you will also find links to our Instagram, Twitter, Facebook, and Pinterest pages. Until the next episode, remember that while the scope of what makes a computer is immense, it’s all just ones and zeros.

References:

  1. Difference Between Unicode and UTF-8: http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/
  2. The Unicode Standard Version 12.0 – Core Specification: https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf
  3. Glossary of Unicode Terms: http://unicode.org/glossary/#code_point
  4. Rob Pike’s UTF-8 history: https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
  5. Unicode 12.1 Character Code Charts – Mathematical Operators: https://www.unicode.org/charts/PDF/U2200.pdf