| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by majestic8 3399 days ago
	Can someone help me understand why prefixes used in UTF-8 jump from "0" to "110", "1110", "11110" and so on? Why is "10" missing?

4 comments

Someone 3399 days ago

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt:

"Below are the guidelines that were used in defining the UCS transformation format: [...] 6) It should be possible to find the start of a character efficiently starting from an arbitrary location in a byte stream."

If they used "10" as a marker for "this is the start of a two-byte sequence", it could not have been used for "this is a byte in a multi-byte sequence, but not the first one"

link

pkaye 3399 days ago

"10" is used as a prefix for the bytes after the first. This gives it the self-synchronization property if it somehow ends up in the middle of a sequence. See the first table in this Wikipedia link: https://en.wikipedia.org/wiki/UTF-8

link

jcranmer 3399 days ago

Additionally, 110xxxxx tells you that the character is two bytes, 1110xxxx three bytes, and 11110xxx four bytes, i.e., number of bytes in number = leading 1 count.

link

nemoniac 3399 days ago

It's called a prefix code. It's a fundamental idea in coding theory.

https://en.wikipedia.org/wiki/Prefix_code

https://en.wikipedia.org/wiki/Coding_theory

link

nicwolff 3399 days ago

The goal is that by reading any byte you can tell if you are at the start of a character sequence, so we have to start each byte with some prefix – otherwise continuation bytes might sometimes look like start bytes. If we did as you suggest, we'd have to prefix continuation bytes with "111110", leaving only two bits of data in each!

link