| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcranmer 3653 days ago

If your goal is to eliminate ICU, there's not any change you can realistically make. Unicode has problems, but the most obvious things to fix (CJK unification, precomposed versus combining characters, different semantic characters with completely identical graphs (Angstrom sign versus A-with-circle-above, e.g.)) do not eliminate the need for ICU.

Languages are horribly complicated. The Turkish ı/İ issue makes capitalization a locale-dependent thing, and things like German ß/ẞ/ss/SS make case conversion in general mind-boggling. The treatment of diacritics in Latin script for collation purposes differs very heavily between major European languages, so sorting and searching are again locale-dependent. And by the time you're dealing with the locale mess of languages, handling locale-specific number, date, and time representations is pretty much trivial.

The need for giant Unicode character tables and CLDR tables, or tables that capture similar information, is quite frankly necessary to handle internationalization to any substantial degree.

1 comments

Someone 3653 days ago

"so sorting and searching are again locale-dependent"

It's worse. Sorting is dependent on the task at hand. http://userguide.icu-project.org/collation: "For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite."

That page has lots more 'interesting' cases, for example:

"Some French dictionary ordering traditions sort accents in backwards order, from the end of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o"."

That means that, given two strings s and t such that s sorts before t, you can append characters to t to get u which sorts before s. EDIT (after reading the reply of kelnage): _for some strings s and t_

kelnage 3653 days ago

No, I don't think that example does imply that. I interpret it as meaning that for the variants of the same "base word" (i.e. all characters are unaccented) the ordering is defined by the positions of the accents rather than their respective orderings. It says nothing about two words that have different lengths or bases.