May be off-topic, I wonder if anyone is planning a redesign of Unicode for the far future? or is there a better way to handle characters, so we don't require a giant library like ICU?
If your goal is to eliminate ICU, there's not any change you can realistically make. Unicode has problems, but the most obvious things to fix (CJK unification, precomposed versus combining characters, different semantic characters with completely identical graphs (Angstrom sign versus A-with-circle-above, e.g.)) do not eliminate the need for ICU.
Languages are horribly complicated. The Turkish ı/İ issue makes capitalization a locale-dependent thing, and things like German ß/ẞ/ss/SS make case conversion in general mind-boggling. The treatment of diacritics in Latin script for collation purposes differs very heavily between major European languages, so sorting and searching are again locale-dependent. And by the time you're dealing with the locale mess of languages, handling locale-specific number, date, and time representations is pretty much trivial.
The need for giant Unicode character tables and CLDR tables, or tables that capture similar information, is quite frankly necessary to handle internationalization to any substantial degree.
"so sorting and searching are again locale-dependent"
It's worse. Sorting is dependent on the task at hand. http://userguide.icu-project.org/collation: "For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite."
That page has lots more 'interesting' cases, for example:
"Some French dictionary ordering traditions sort accents in backwards order, from the end of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o"."
That means that, given two strings s and t such that s sorts before t, you can append characters to t to get u which sorts before s. EDIT (after reading the reply of kelnage): _for some strings s and t_
No, I don't think that example does imply that. I interpret it as meaning that for the variants of the same "base word" (i.e. all characters are unaccented) the ordering is defined by the positions of the accents rather than their respective orderings. It says nothing about two words that have different lengths or bases.
What would you do differently? Unicode isn't complex because people like things that are hard to understand, it's complex because it took on an exceedingly difficult problem.
Given more and more custom fonts in the OSes/websites, maybe by using some new APIs, we don't need to specify everything in the Unicode standard. We can design a new font format or just a separate datafile, to store those locale-specific information. The Unicode code points then becomes parking slots for different fonts(with locale info to be registered). And We can use the standard/default datafile to keep the old info about the current unicode standard (say Unicode 8.0).
This is just my first thought. Seems that the job of ICU is transfered to the OS or web browser.
I think the Unicode standard should not limit the use of fonts. Instead, let the font or the additonal locale datafile tell us how to deal with those locale issues.
If you want to handle characters by anything much simpler than current Unicode, you need to simplify the reality that Unicode describes, changing or eliminating a bunch of major human languages. Not all of them, and not even most of them, but still hundreds of millions of people would need to change how they use their language.
It could happen in a century or two, actually, we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.
Simplication (caused by internationalization) and diversification (caused by localization) are two ends of a spectrum, but languages, both their spoken and written forms, have bounced between those ends throughout history. In a century or two, by the time simplification has succeeded on Earth, the settlers on Titan will rebel with their own graphical symbols for displaying language.
> we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.
I know you're not necessarily advocating it, but if our cultures change to adapt to our technological limitations, that's the reverse of what I think should be happening - there's a problem with the tech.
Languages are horribly complicated. The Turkish ı/İ issue makes capitalization a locale-dependent thing, and things like German ß/ẞ/ss/SS make case conversion in general mind-boggling. The treatment of diacritics in Latin script for collation purposes differs very heavily between major European languages, so sorting and searching are again locale-dependent. And by the time you're dealing with the locale mess of languages, handling locale-specific number, date, and time representations is pretty much trivial.
The need for giant Unicode character tables and CLDR tables, or tables that capture similar information, is quite frankly necessary to handle internationalization to any substantial degree.