Hacker News new | ask | show | jobs
by majestic8 3399 days ago
Can someone help me understand why prefixes used in UTF-8 jump from "0" to "110", "1110", "11110" and so on? Why is "10" missing?
4 comments

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt:

"Below are the guidelines that were used in defining the UCS transformation format: [...] 6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream."

If they used "10" as a marker for "this is the start of a two-byte sequence", it could not have been used for "this is a byte in a multi-byte sequence, but not the first one"

"10" is used as a prefix for the bytes after the first. This gives it the self-synchronization property if it somehow ends up in the middle of a sequence. See the first table in this Wikipedia link: https://en.wikipedia.org/wiki/UTF-8
Additionally, 110xxxxx tells you that the character is two bytes, 1110xxxx three bytes, and 11110xxx four bytes, i.e., number of bytes in number = leading 1 count.
It's called a prefix code. It's a fundamental idea in coding theory.

https://en.wikipedia.org/wiki/Prefix_code

https://en.wikipedia.org/wiki/Coding_theory

The goal is that by reading any byte you can tell if you are at the start of a character sequence, so we have to start each byte with some prefix – otherwise continuation bytes might sometimes look like start bytes. If we did as you suggest, we'd have to prefix continuation bytes with "111110", leaving only two bits of data in each!