| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by voaie 3649 days ago
	May be off-topic, I wonder if anyone is planning a redesign of Unicode for the far future? or is there a better way to handle characters, so we don't require a giant library like ICU?

4 comments

jcranmer 3649 days ago

If your goal is to eliminate ICU, there's not any change you can realistically make. Unicode has problems, but the most obvious things to fix (CJK unification, precomposed versus combining characters, different semantic characters with completely identical graphs (Angstrom sign versus A-with-circle-above, e.g.)) do not eliminate the need for ICU.

Languages are horribly complicated. The Turkish ı/İ issue makes capitalization a locale-dependent thing, and things like German ß/ẞ/ss/SS make case conversion in general mind-boggling. The treatment of diacritics in Latin script for collation purposes differs very heavily between major European languages, so sorting and searching are again locale-dependent. And by the time you're dealing with the locale mess of languages, handling locale-specific number, date, and time representations is pretty much trivial.

The need for giant Unicode character tables and CLDR tables, or tables that capture similar information, is quite frankly necessary to handle internationalization to any substantial degree.

link

Someone 3649 days ago

"so sorting and searching are again locale-dependent"

It's worse. Sorting is dependent on the task at hand. http://userguide.icu-project.org/collation: "For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite."

That page has lots more 'interesting' cases, for example:

"Some French dictionary ordering traditions sort accents in backwards order, from the end of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o"."

That means that, given two strings s and t such that s sorts before t, you can append characters to t to get u which sorts before s. EDIT (after reading the reply of kelnage): _for some strings s and t_

link

kelnage 3649 days ago

No, I don't think that example does imply that. I interpret it as meaning that for the variants of the same "base word" (i.e. all characters are unaccented) the ordering is defined by the positions of the accents rather than their respective orderings. It says nothing about two words that have different lengths or bases.

link

jandrese 3649 days ago

What would you do differently? Unicode isn't complex because people like things that are hard to understand, it's complex because it took on an exceedingly difficult problem.

link

voaie 3649 days ago

Given more and more custom fonts in the OSes/websites, maybe by using some new APIs, we don't need to specify everything in the Unicode standard. We can design a new font format or just a separate datafile, to store those locale-specific information. The Unicode code points then becomes parking slots for different fonts(with locale info to be registered). And We can use the standard/default datafile to keep the old info about the current unicode standard (say Unicode 8.0).

This is just my first thought. Seems that the job of ICU is transfered to the OS or web browser.

link

voaie 3649 days ago

I think the Unicode standard should not limit the use of fonts. Instead, let the font or the additonal locale datafile tell us how to deal with those locale issues.

link

PeterisP 3649 days ago

If you want to handle characters by anything much simpler than current Unicode, you need to simplify the reality that Unicode describes, changing or eliminating a bunch of major human languages. Not all of them, and not even most of them, but still hundreds of millions of people would need to change how they use their language.

It could happen in a century or two, actually, we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.

link

vorg 3649 days ago

Simplication (caused by internationalization) and diversification (caused by localization) are two ends of a spectrum, but languages, both their spoken and written forms, have bounced between those ends throughout history. In a century or two, by the time simplification has succeeded on Earth, the settlers on Titan will rebel with their own graphical symbols for displaying language.

link

hackuser 3649 days ago

> we are seeing some language trends that do favor internationalization and simplification over localization and keeping with linguistic tradition.

I know you're not necessarily advocating it, but if our cultures change to adapt to our technological limitations, that's the reverse of what I think should be happening - there's a problem with the tech.

link

voaie 3649 days ago

Right, there will be less common languages. The faded ones could be kept in the digital world by using special fonts.

link

damienkatz 3649 days ago

ICU would still be necessary for collation and case conversion.

link