Hacker News new | ask | show | jobs
by gfody 98 days ago
the talk page behind the utf-16 wiki is actually quite interesting. it seems the manifesto guys tried to push their agenda there, and the allusions to "real text" with missing citations are a remnant of that. obv there's no such thing as "real text" and the statements about it containing many spaces and punctuation are nonsense (many languages do not delimit words with spaces, plenty of text is not mostly markup, and so on..)

despite the frothing hoard of web developers desperate to consider utf-16 harmful, it's still a fact that the consortium optimized unicode for 16-bits (https://www.unicode.org/notes/tn12) and their initial guidance to use utf-8 for compatibility and portability (like on the web) and utf-16 for efficiency and processing (like in a database, or in memory) is still sound.

1 comments

Interesting link! It shows its age though (22 years), as it makes the point that utf-16 is already the "most dominant processing format", but if that would be the deciding factor, then utf-8 would be today's recommendation, as utf-8 is the default for online data exchange and storage nowadays, all my software assumes utf-8 as the default as well. But I can't speak for people living and trading in places like S.Asia, like you.

If one develops for clients requiring a varying set of textual scripts, one could sidestep an ideological discussion and just make an educated guess about the ratio of utf-8 vs utf-16 penalties. That should not be complicated; sometimes utf-8 would require one more byte than utf-16 would, sometimes it's the other way around.