Hacker News new | ask | show | jobs
by est 4659 days ago
> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

1 comments

The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.
I totally agree that UTF-8 is pretty good at exchanging data because UTF8 is better than UCS-* and other UTF-* overall, and because everyone is (and should be) using it.

But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.

I'm defining “exchange” as not just the outside world but also among components in your own architecture. I've flushed out enough problems with the UCS-2 / UTF-16 breakage that I've become somewhat sold on the unambiguous nature of UTF-8.

For the classic double-byte languages, what's your total data size after compression? e.g. in the case of full-text search, enabling compression has been enough of a win that the 2/3-byte expansion hasn't been a challenge, particularly since the biggest UTF-8 drawback (inability to predict total byte string length) isn't an issue when working with a data structure which records the length of each record.