| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by nabla9 2386 days ago

Unicode has two really great features.

* It names and defines things and sets standard. This seems trivial but is incredibly useful.

* Unicode encodings, mainly UTF-8 are good storage format for text (as a data structure for editing text, not so much if you want to be universal).

Unicode has one really horrible failing.

The 'user-perceived character' (Unicode terminology) is arguably the most important unit in text. Unicode approximates user-perceived characters using set of general rules to define grapheme clusters. A Grapheme cluster is a sequence of adjacent code points that should be treated as a unit by applications. Unfortunately the ruleset and definition is inadequate. Sometimes you need two grapheme clusters to define one unit.

If you get UTF-8 encoded and normalized string from somewhere from some unspecified time and era, don't know what application wrote it, using what version of UNICODE standard and what was the locale, you may lose some information.

Unicode should have added explicit encoding for user-perceived character boundaries (either fixed grapheme cluster eoncoding or completely different encoding). Let the writing software define it explicitly. It would have been future-proof (new software in the future can understand old strings) and past-proof (ancient software can understand and edit strings written in the future).