Hacker News new | ask | show | jobs
by sillysaurus2 4660 days ago
Storage space is cheap, and the price continues to fall. Storage of text is virtually nothing. Bandwidth to send text is almost nothing. Also, most text is compressed, which virtually eliminates that concern.

Programmer time is at least two orders of magnitude more expensive than storage space or bandwidth for text.

2 comments

> Storage space is cheap, and the price continues to fall.

At-rest storage is cheap. Memory is cheaper than it used to be, but CPU cache is not. At some point the text will have to cross the CPU where every byte still counts.

> Storage space is cheap

True, but

1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer.

2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.)

3. Network transfer. If you can save 50% in a db connection rtt, you save a lot.

It makes no sense to save BMP in 3 bytes anyway.

You'd have to have rather weird data for it to be anywhere near 50% larger for real text (i.e. even if you only use Chinese, if you have punctuation, arabic numerals, quotes or URLs, HTML, etc. the averages cancel more than you might think) and a completely incompetent search engine design for that to remotely approach 50% more time to query or index.

If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)

> nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages ...

That's exactly how those UTF-X and UCS-Y encodings were invented, right?

The point is, this beast is called unicode, how ironic.

The difference is internal vs. external: if you use UTF-8 things simply work better any time you exchange data. If you had lots of high code points (>\U+8000) you would get better results in most cases using a full compression system rather than playing with 2-byte encoding hacks, which is really all you're talking about in both cases.
I totally agree that UTF-8 is pretty good at exchanging data because UTF8 is better than UCS-* and other UTF-* overall, and because everyone is (and should be) using it.

But you know, there are other cases besides exchanging. Like I said, if your text data is mainly latin you are good, but not so good if you are stuck with non-latin BMPs.

I'm defining “exchange” as not just the outside world but also among components in your own architecture. I've flushed out enough problems with the UCS-2 / UTF-16 breakage that I've become somewhat sold on the unambiguous nature of UTF-8.

For the classic double-byte languages, what's your total data size after compression? e.g. in the case of full-text search, enabling compression has been enough of a win that the 2/3-byte expansion hasn't been a challenge, particularly since the biggest UTF-8 drawback (inability to predict total byte string length) isn't an issue when working with a data structure which records the length of each record.