Hacker News new | ask | show | jobs
by acdha 4659 days ago
You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

1 comments

> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.
I totally agree that UTF-8 is pretty good at exchanging data because UTF8 is better than UCS-* and other UTF-* overall, and because everyone is (and should be) using it.

But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.

I'm defining “exchange” as not just the outside world but also among components in your own architecture. I've flushed out enough problems with the UCS-2 / UTF-16 breakage that I've become somewhat sold on the unambiguous nature of UTF-8.

For the classic double-byte languages, what's your total data size after compression? e.g. in the case of full-text search, enabling compression has been enough of a win that the 2/3-byte expansion hasn't been a challenge, particularly since the biggest UTF-8 drawback (inability to predict total byte string length) isn't an issue when working with a data structure which records the length of each record.