| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gpderetta 1109 days ago

As a bit of history, Windows NT 3.1 was published in summer '93, and it was the first Windows version with Unicode support (UCS-2, not UTF-16 that didn't exist yet). Presumably development started well before that.

UTF-8 was publicly presented at USENIX at the beginning of '93. Not sure when Unicode incorporated it.

It is unlikely that Windows would have been changed at the last minute to use it, especially as the variable encoding of UTF-8 was significantly more complicated than the fixed size UCS-2.

1 comments

chubot 1108 days ago

Thanks, yeah that's basically what I thought, but it's nice to know it was the same year!

If only UTF-8 had been invented a little earlier, we could have avoided so much pain :-(

The idea of global varables like LANG= and LC_TYPE= in C is utterly incoherent.

Python's notion of "default file system encoding" is likewise incoherent.

You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data. Encodings shouldn't be global variables!

Python 3 made things worse in many ways, largely due to adherence to Windows legacy, and then finally introduced UTF-8 mode:

https://vstinner.github.io/painful-history-python-filesystem...

link

numpad0 1108 days ago

> You can obviously process strings with two different encodings in the same program !!! Encodings are metadata, and metadata should be attached to data.

So, you can't, because Unicode processing can be (though I'm not sure how much is) locale dependent, and that that metadata is NOT attached to data. Unicode Consortium had been messing up non-Latin languages multiple times, causing hacks and new standards to build on top of UTF-8. Han Unification immediately comes to mind[1], but there are others as the Korean Mess[2], Cambodian Khmer problem[3], to name a few. I don't quite understand why it's always has to be like that.

1: Sets of characters from zh-Hans(zh-CN), zh-Hant(zh-TW), kr-KR, ja-JP that were deemed "same" were merqed lnto same code points, in an attempt to keep commonly used UTF-8 in nice 2 bytes

2: Korean Hangul characters were literally relocated between Unicode 1.1 to Unicode 2.0, causing affected characters written in 1.1 displayed in just unrelated characters

3: Reportedly the Consortium simply did not have a Cambodian linguist(???) (partly due to unrest and genocide that took place during 60s-80s)

link

chubot 1108 days ago

Well what I'm saying if you have 2 different web pages, with 2 different declared encodings

Then a decent library design would let you process those in different threads in the same program

A global variable like LANG= inhibits that

So if you have metadata, it should be attached to the DATA, and not the CODE

---

Same thing with a file system. You can obviously have 2 different files on the same disk with different encodings. So Python's global FS encoding and global encoding doesn't make any sense.

They are basically "punting" on the problem of where the metadata is, and the programmer often has NO WAY to solve that problem!

---

The issues you mention are interesting but I think independent of what I'm saying

link