| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by est 4995 days ago

UCS2 is better than UTF8 internally because it counts unicode characters faster than UTF8. Every character is juest 2 bytes, instead of 1, 2, 3 or even 4 bytes.

In python:

    len(u'汉字') == 2
    len( '汉字') == 4 # or maybe 6, it varies based on console encoding and CPython options
    len(u'汉字'.encode('utf8')) == 6

5 comments

csense 4995 days ago

Issues like this are why I hate internationalization.

If it was simple as making everything Unicode and it Just Working, it would be possible. But the number of difficulties and problems I've seen have made me decide -- and tell everyone I know -- to avoid dealing with internationalization if you value your sanity.

Issues discussed here:

* Different incompatible variable-length encodings

* Broken implementations

* Character count != length

Issues discussed elsewhere:

* It's the major showstopper keeping people away from Python 3

* Right-to-left vs. left-to-right [1]

* BOM at the beginning of the stream

Conceptual issues -- questions I honestly don't know the answer to when it comes to internationalization. I don't even know where to look to find answers to these:

* If I split() a string, does each piece get its own BOM?

* If I copy-paste text characters from A into B, what encoding does B save as? If B isn't a text editor, what happens?

* If a chr(0x20) is part of a multi-byte escape sequence, does it count as a space when I use .split()?

* When it encounters right-to-left, does the renderer have to scan the entire string to figure out how far to the right it goes? Wouldn't this mean someone could create a malicious length-n string that took O(n^2) time to process?

* What happens if I try to print() a very long line -- more than a screenful -- with a right-to-left escape in a terminal?

* If I have a custom stream object, and I write characters to it, how does it "know" when to write the BOM?

* Do operators like [] operate on characters, bytes, 16-bit words, or something else?

* Does getting the length of a string really require a custom for loop with a complicated bit-twiddling expression?

* Is it possible for a zero byte to be part of a multibyte sequence representing a character? How does this work with C API's that expect zero-terminated strings?

* If I split() a string to extract words, how do the substrings know the BOM, right-to-left, and other states that apply to themselves? What if those strings are concatenated with other strings that have different values for those states?

* What exactly does "generating locales" do on Debian/Ubuntu and why aren't those files shipped with all the other binary parts of the distribution? All I know about locale generation is that it's some magic incantation you need to speak to keep apt-get from screaming bloody murder every time you run it on a newly debootstrapped chroot.

* Is there a MIME type for each historical, current, and future encoding? How do Web things know which encoding a document's in?

* How do other tools know what encoding a document uses? Is this something the user has to manually tell the tool -- should I be saying nano thing.txt --encoding=utf8? If the information about the format isn't stored anywhere, do you just guess until you get something that seems not to cause problems?

* If you're using UTF-16, what endianness is used? Is it the same as the machine endianness, or fixed? What operations cause endian conversion?

* Should my C programs handle the possibility that sizeof(char) != 1? Or at least check for this case and spit out a warning or error?

* What automated tools exist to remove BOM's or change accented characters into regular ones, if other automated tools don't accept Unicode? Once upon a time, I could not get javac to recognize source files I'd downloaded which had the author's name, which included an 'o' with two dots over it, in comments. That was the only non-ASCII character in the files, and I ended up removing them; syncing local patches with upstream would have been a nightmare. Do people in different countries run incompatible versions of programming languages that won't accept source files that are byte-for-byte identical? It sounds ridiculous, but this experience suggests it may be the case.