| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by est 4707 days ago

> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

1 comments

acdha 4707 days ago

The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.

link

est 4707 days ago

I totally agree that UTF-8 is pretty good at exchanging data because UTF8 is better than UCS-* and other UTF-* overall, and because everyone is (and should be) using it.

But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.

link

acdha 4706 days ago

I'm defining “exchange” as not just the outside world but also among components in your own architecture. I've flushed out enough problems with the UCS-2 / UTF-16 breakage that I've become somewhat sold on the unambiguous nature of UTF-8.

For the classic double-byte languages, what's your total data size after compression? e.g. in the case of full-text search, enabling compression has been enough of a win that the 2/3-byte expansion hasn't been a challenge, particularly since the biggest UTF-8 drawback (inability to predict total byte string length) isn't an issue when working with a data structure which records the length of each record.

link