|
|
|
|
|
by senknvd
1667 days ago
|
|
Funnily enough, the situation on FreeBSD is reversed: cut(1) indexes by Unicode codepoints and awk(1) indexes by bytes. But, more importantly, just indexing by codepoints isn't enough. Grapheme clusters are the real unit of user-perceived characters [1][2]. Technically and practically (just replace the anime music with Indian classical music and you're sure to find some artists with names that can't be split into codepoints), both implementations of cut(1) are broken. [1]: https://manishearth.github.io/blog/2017/01/14/stop-ascribing...
[2]: https://unicode.org/reports/tr29/ |
|