| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by millstone 4706 days ago

Code points is a better abstraction than code units, but it's still a piss-poor abstraction.

Consider the problem of producing a valid substring from a Unicode string. It's important that you not split surrogate pairs, and it's true working with code points spares you from that particular problem. But it's also important that you not split combining marks, and zero width joiners, and Hangul syllables... (see http://www.unicode.org/reports/tr29/ for all the gory details).

An average programmer cannot correctly extract a substring from a Unicode string whether given the code units or the code points. These abstractions are inadequate: instead you want something like grapheme clusters.

1 comments

pyre 4706 days ago

This was my reaction too. It's Unicode all the way down... :)

link