| HN Mirror

For example, the text "ch" (U+0063 U+0068) is two grapheme clusters in English contexts, but one grapheme cluster in Czech contexts, collated between "h" and "i". [1]

According to Unicode, the text "Chemie" is written exactly the same whether it's the German or the Czech word. However, a German will say it has six letters and a Czech will say it has five.

Unicode provided a unified way to express international characters within the same text, but the context (i.e. locale) external to the text is still required to sensibly collate and manipulate it according to human sensibilities.

The default definition of grapheme clusters is simply a compromise for a global, locale-less understanding of collation/manipulation of Unicode characters.

> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes. For example, “ch” may be considered a grapheme cluster in Slovak, for processes such as collation. The default definitions are, however, designed to provide a much more accurate match to overall user expectations for what the user perceives of as characters than is provided by individual Unicode code points.

> Note: The default Unicode grapheme clusters were previously referred to as "locale-independent graphemes." The term cluster is used to emphasize that the term grapheme is used differently in linguistics. For simplicity and to align terminology with Unicode Technical Standard #10, “Unicode Collation Algorithm” [UTS10], the terms default and tailored are preferred over locale-independent and locale-dependent, respectively.

[1] https://en.wikipedia.org/wiki/Ch_(digraph)

[2] http://www.unicode.org/reports/tr29/