Hacker News new | ask | show | jobs
by flohofwoe 2394 days ago
IMHO the article should mention that UTF-16 was (more or less) a hack to fix Windows and some other systems which didn't see the light and use UTF-8 from the start. UTF-16 has all the disadvantes of UTF-8 (variable length) and UTF-32 (endianess), but none of the advantages (encoding as endian-agnostic, 7-bit ASCII compatible byte stream like UTF-8, or a fixed-width encoding like UTF-32). UTF-16 should really be considered a hack to talk to (mainly) Windows APIs.

Also, obligatory link to: https://utf8everywhere.org/

2 comments

Windows and many other operating systems and languages (Java) got on board with Unicode back when the character set would fit in 16bits. The character set originally used was UCS-2 (not UTF-16). UTF-16 came next to extend the Unicode character set beyond 65536 code points.

UTF-8 wasn't even invented until well after all these operating systems and languages deployed Unicode.

They didn't see the light of day to use UTF-8 because they didn't have a time machine to make that possible.

I actually checked a while ago when UTF-8 was created, and it was just around the same time when Windows NT was developed with 16-bit "early" Unicode support. UTF-8 was created in September 1992 [1], and Windows NT came out mid 1993, but I guess it was too late for Windows to change to UTF-8 (and I guess the advantages of UTF-8 haven't been as clear back then).

But IMHO there's no excuse to not use UTF-8 after around 1995 ;)

[1] https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

Also, UTF-16 was only published in July 1996 (although the need for more than 16 bits was probably apparent a bit earlier). So before that, Unicode was only a 16-bit encoding, and UCS-2 was enough. UTF-8 was initially just a nice trick to keep using ASCII characters for things like directory separators (/) and single-byte NUL terminators. By 1995 its superiority certainly wasn't apparent yet.

Also, Windows internals were completely 16-bit-character based, including e.g. the NTFS disk format, so by 1992 that was already quite hard to change.

That said, it is crazy that NT didn't have full UTF-8 support, including in console windows, by about 2000.

The main point that should be emphasised is that any encoding with fixed size unicode codepoints is mostly unnecessary as you mostly don’t care about the codepoints but about how the resulting glyphs or even glyph runs look like.

My experience is that if you want to implement efficient unicode-aware text editor then the right datastructure is list of lines and you have to simply forget about gap buffers, ropes and what not (unless you really care about 32k+ lines/paragraphs, which is when rope-style representation starts to make sense as long as the breaks match unicode semantics)