| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by senknvd 1713 days ago

Funnily enough, the situation on FreeBSD is reversed: cut(1) indexes by Unicode codepoints and awk(1) indexes by bytes.

But, more importantly, just indexing by codepoints isn't enough. Grapheme clusters are the real unit of user-perceived characters [1][2]. Technically and practically (just replace the anime music with Indian classical music and you're sure to find some artists with names that can't be split into codepoints), both implementations of cut(1) are broken.

[1]: https://manishearth.github.io/blog/2017/01/14/stop-ascribing... [2]: https://unicode.org/reports/tr29/