Hacker News new | ask | show | jobs
by mcpherrinm 5184 days ago
The problem is that Unicode doesn't know about language. Unicode is just characters.

Language-aware bits are more gross, but then language often is. It's not nicely structured like most of the other things we encounter when transforming data.

1 comments

> The problem is that Unicode doesn't know about language. Unicode is just characters.

I won't blame you for this, it is a common mistake, but Unicode goes far beyond merely mapping characters to integers. The Standard Annexes, Technical Reports and Technical Specifications cover pretty much all things localization from line breaking [UAX14] to regular expressions [UTS18] through date and time formatting [UTS35] or sorting [UTS10].

And as it turns out, both uppercasing and titlecasing are covered by [UAX44] as part of the SpecialCasing.txt file which provides lower, upper and title-casing (along with optional conditions) for characters with non-trivial mappings (trivial 1:1 mappings are covered in the base UnicodeData.txt file)

[UAX14] http://www.unicode.org/reports/tr14/

[UTS18] http://www.unicode.org/reports/tr18/

[UTS35] http://www.unicode.org/reports/tr35/

[UTS10] http://www.unicode.org/reports/tr10/

[UAX44] http://www.unicode.org/reports/tr44/