|
|
|
|
|
by chubot
1114 days ago
|
|
Thanks, yeah that's basically what I thought, but it's nice to know it was the same year! If only UTF-8 had been invented a little earlier, we could have avoided so much pain :-( The idea of global varables like LANG= and LC_TYPE= in C is utterly incoherent. Python's notion of "default file system encoding" is likewise incoherent. You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data. Encodings shouldn't be global variables! Python 3 made things worse in many ways, largely due to adherence to Windows legacy, and then finally introduced UTF-8 mode: https://vstinner.github.io/painful-history-python-filesystem... |
|
So, you can't, because Unicode processing can be (though I'm not sure how much is) locale dependent, and that that metadata is NOT attached to data. Unicode Consortium had been messing up non-Latin languages multiple times, causing hacks and new standards to build on top of UTF-8. Han Unification immediately comes to mind[1], but there are others as the Korean Mess[2], Cambodian Khmer problem[3], to name a few. I don't quite understand why it's always has to be like that.
1: Sets of characters from zh-Hans(zh-CN), zh-Hant(zh-TW), kr-KR, ja-JP that were deemed "same" were merqed lnto same code points, in an attempt to keep commonly used UTF-8 in nice 2 bytes
2: Korean Hangul characters were literally relocated between Unicode 1.1 to Unicode 2.0, causing affected characters written in 1.1 displayed in just unrelated characters
3: Reportedly the Consortium simply did not have a Cambodian linguist(???) (partly due to unrest and genocide that took place during 60s-80s)