Hacker News new | ask | show | jobs
by tomp 2386 days ago
Can you give examples of locale-dependent things, or issues with extended grapheme clusters?
2 comments

Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons. The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.

The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3] require the tailoring to grapheme clusters. Depending on the view, you can also consider orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).

[1] https://www.unicode.org/reports/tr10/#Trailing_Weights

[2] https://unicode.org/reports/tr29/

[3] https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition

For example, the text "ch" (U+0063 U+0068) is two grapheme clusters in English contexts, but one grapheme cluster in Czech contexts, collated between "h" and "i". [1]

According to Unicode, the text "Chemie" is written exactly the same whether it's the German or the Czech word. However, a German will say it has six letters and a Czech will say it has five.

Unicode provided a unified way to express international characters within the same text, but the context (i.e. locale) external to the text is still required to sensibly collate and manipulate it according to human sensibilities.

The default definition of grapheme clusters is simply a compromise for a global, locale-less understanding of collation/manipulation of Unicode characters.

> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes. For example, “ch” may be considered a grapheme cluster in Slovak, for processes such as collation. The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points.

> Note: The default Unicode grapheme clusters were previously referred to as "locale-independent graphemes." The term cluster is used to emphasize that the term grapheme is used differently in linguistics. For simplicity and to align terminology with Unicode Technical Standard #10, “Unicode Collation Algorithm” [UTS10], the terms default and tailored are preferred over locale-independent and locale-dependent, respectively.

[1] https://en.wikipedia.org/wiki/Ch_(digraph)

[2] http://www.unicode.org/reports/tr29/