| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lifthrasiir 2389 days ago

Hangul normalization and collation is broken in Unicode, albeit for slightly different reasons. The Unicode Collation Algorithm explictly devotes two sections related to Hangul; the first section, for "trailing weights" [1], is recommended for the detailed explanation.

The Unicode Text Segmentation standard [2] explicitly mentions that Indic aksaras [3] require the tailoring to grapheme clusters. Depending on the view, you can also consider orthographic digraphs as examples (Dutch "ij" is sometimes considered a single character for example).

[1] https://www.unicode.org/reports/tr10/#Trailing_Weights

[2] https://unicode.org/reports/tr29/

[3] https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition