| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by numpad0 1113 days ago

> You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data.

So, you can't, because Unicode processing can be (though I'm not sure how much is) locale dependent, and that that metadata is NOT attached to data. Unicode Consortium had been messing up non-Latin languages multiple times, causing hacks and new standards to build on top of UTF-8. Han Unification immediately comes to mind[1], but there are others as the Korean Mess[2], Cambodian Khmer problem[3], to name a few. I don't quite understand why it's always has to be like that.

1: Sets of characters from zh-Hans(zh-CN), zh-Hant(zh-TW), kr-KR, ja-JP that were deemed "same" were merqed lnto same code points, in an attempt to keep commonly used UTF-8 in nice 2 bytes

2: Korean Hangul characters were literally relocated between Unicode 1.1 to Unicode 2.0, causing affected characters written in 1.1 displayed in just unrelated characters

3: Reportedly the Consortium simply did not have a Cambodian linguist(???) (partly due to unrest and genocide that took place during 60s-80s)

1 comments

chubot 1112 days ago

Well what I'm saying if you have 2 different web pages, with 2 different declared encodings

Then a decent library design would let you process those in different threads in the same program

A global variable like LANG= inhibits that

So if you have metadata, it should be attached to the DATA, and not the CODE

---

Same thing with a file system. You can obviously have 2 different files on the same disk with different encodings. So Python's global FS encoding and global encoding doesn't make any sense.

They are basically "punting" on the problem of where the metadata is, and the programmer often has NO WAY to solve that problem!

---

The issues you mention are interesting but I think independent of what I'm saying