|
|
|
|
|
by est
4662 days ago
|
|
> Storage space is cheap True, but 1. time is precious. For example, you waste 50% more time for a fulltext indexing scan because utf8 is longer. 2. Memory. If you can't hold text in a single machine, you have bigger issues (e.g. clustering algorithms, persistency, redundancy, etc.) 3. Network transfer. If you can save 50% in a db connection rtt, you save a lot. It makes no sense to save BMP in 3 bytes anyway. |
|
If you were assigned the task of indexing the UTF-8 worst case corpus, nothing would stop you from designing a custom internal encoding while enjoying the many technical advantages UTF-8 gives you in every other area. Y internal details like compression are much easier to change than dealing with external interfaces which must be coordinated (this is why JavaScript still has such painful Unicode support even though browsers handle almost everything well in markup)