| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cryptonector 2867 days ago

Can I assume that's just subtle humor on your part?

If not:

UTF-16 is born of UCS-2 being a very poor codeset, as it was limited to the Unicode BMP, which means 2^16 codepoints, but Unicode has many more codepoints, so users couldn't have the pile-of-poo emoticon. Something had to be done, and that something was to create a variable-length (in terms of 16-bit code units) encoding using a few then-unassigned codepoints in the BMP. The result yields only a sad, pathetic, measly 2^21 codepoints, and that's just not that much. Moreover, while many codesets play well with ASCII, UTF-16 doesn't. Also, decomposed forms of Unicode glyphs necessarily involve multiple codepoints, thus multiple code units... Many programmers hate variable length text encoding because they can't do simple array indexing operations to find the nth character in a string, but with UTF-8, UTF-16, and just plain decomposition, that's a fact of life anyways. If you're going to have a variable-length codeset encoding, you might as well use UTF-8 and get all its plays-nice-with-ASCII benefits. For Latin-mostly text UTF-8 also is more efficient than UTF-16, so there is a slight benefit there.

Much of the rest of the non-Windows, non-ECMAScript world has settled on UTF-8, and that's a very very good thing.

1 comments

swozey 2867 days ago

No humor whatsoever. Thank you for the explanation! I'm an ops person who knows python/golang to a dangerous extent and have never gone out of my way to understand the UTF reasonings. Your post intrigued me and made me want to ask why you felt that way. This will make me sound horrendously ignorant to someone of the likes of someone such as yourself but I'm here to learn.

link