| HN Mirror

I have attempted to answer your questions, but as a general comment, most of them would be answered by having a basic understanding of unicode and popular encodings for it. You're a programmer, understanding this is part of your job. (If you aren't a programmer,then my my do you sure ask a lot of technical questions.) You could read this: http://www.joelonsoftware.com/articles/Unicode.html. Then read a bit more. Then forget everything. Then read a bunch more and then give a small talk on unicode. Then suddenly it's 2AM and you're answering asinine questions about unicode on HN. What I'm saying is, be careful.

One last general comment I have is that a lot of your questions relate to things that you shouldn't necessarily need to understand the exact details of to do a lot of things. Instead, you should use an off-the-shelf, battle-tested unicode library. As far as I know, they exist for every platform by now. Of course this doesn't free you from knowing stuff, but it means that instead of knowing exactly the range of the surrogate pairs, all you need is a mental model of what's going on. When you're surprised, you can begin to fill in the gaps.

1. Use UTF-8 if you're using byte strings, or your platforms unicode string type. If the latter, it will have its own internal encoding, and you'll work at the character level. In either case, as soon as a string comes in from elsewhere (network, disk, OS), sanitize it to a known state, i.e., a UTF-8-encoded byte string, or your platform's unicode string type. In case you can't do that, reject the string, log it, and figure out what's going on.

2. Use a non-broken UTF-8 implementation.

3. Yeah. Your UTF-8 implementation is handling this for you now.

4. Not familiar with this/don't use Python enough.

5. Haven't dealt with this, but it's definitely got some complications to it. I would guess more for layout than for programming, but I can't be sure.

6. The BOM tells your decoder to that the forthcoming stream is little endian or big endian. It itself is not actually part of the string. Admittedly, a lot of programs have trouble with BOMS still, which is why you're using UTF-8 (without BOMs, because you don't need it.)

7. No, when you .split, the BOM is no longer part of the string. You don't have a BOM again until you transmit the string or write it to disk, as it's not needed (your implementation uses whatever byte ordering it likes internally, or the one you specified if using a byte string.)

8. The string is probably transmitted in whatever the internal encoding of your OS is. That means UTF-16 on Windows and UTF-8 on Linux, AFAIK. If you're writing a desktop app, your paste event should treat this string as a pile of unsafe garbage until you've put it into a known state (i.e., a unicode object or byte string in a known encoding.) When you save it, it saves in whatever encoding you specify. You should specify UTF-8.

9. I'm not sure exactly what we're talking about here. The byte 0x20 is a valid UTF-8 character (the space), or part of a 2-byte or 4-byte character in UTF-16. However, as long as you're working with a unicode string type, your .split function operators on the logical character level, not the byte level. If you're using a byte string (e.g., python's string type), then yes, the byte 0x20 is a space, because your split method assumes ASCII. If you try to append a byte string containing 0x20 to a unicode string, you should get an exception, unless you also specify an encoder which takes your byte string and turns it into a unicode string. Your unicode string implementation may have a default encoder, in which case the byte string would be interpreted as that encoding, and an exception would only be thrown if it's invalid (which means if the default encoding was UTF-16, this would throw an exception, because 0x20 is not a valid UTF-16 character.) This answer is long, and HN's formatting options are lacking, so let me know, and I'll try to be clearer.

10. Again I haven't yet dealt with RTL, but the characters are in the same order internally regardless of whether they're to be displayed RTL or LTR. It's a sequence of bytes or characters, the encoder and decoder do not care what those characters actually are. So if I have "[RTL Char]ABC ", that's the order it will be in memory, even though it will display as " CBA." In UTF-8, this string would be 7 bytes long, in UTF-16, 10. In both cases, the character length of the string is 5.

11. I'm not sure why this would be a problem, provided your terminal can handle unicode, which most can in my experience (there's some fiddling you have to do in windows.) It should wrap or break the line the same as with RTL. I believe the unicode standard includes rules for how to break text, but not positive.

12. I'm not really sure what you mean. Your object will write whatever bytes it writes. If you're using a UTF-16 encoder, usually you can specify the Endianness and whether to write a BOM or not.

13. If you're using a unicode type in a language like python, [] operates on logical characters. If you're using a byte string type (python's string, a PHP or C string, for ex), [] operates on bytes.

14. If you're using a unicode type, split returns unicode objects, which have their own internal encoding. Again, right-to-left characters look exactly like any other character to the encoder and decoder. If you're using byte strings, you need to use a unicode-aware split function, and tell it what encoding the string is in. It will return to you strings in the same encoding (and endianness.)

16. Not familiar with this.

17. MIME types are separate from encoding. I can have an HTML page that is in UTF-8 or UTF-16, both have the MIME type txt/html, same with txt/plain and so on. MIME and encoding operate at different levels. Web things knowing the encoding is actually fairly complicated. The correct thing to do is to send an HTTP header that specifies the encoding, and add a <meta charset="bla"> attribute in the HEAD of the page. If you don't do this, I think browsers implement various heuristics to attempt to detect the encoding. Having a type for every future encoding is an unreasonable demand, because the future is notoriously difficult to predict. If you have a crystal ball, I'm sure the standards committees would love to hear from you.

18. Also somewhat complicated! I think there are tools which can guess for you, using similar heuristics that I mentioned that browsers use. You should specify the encoding using your text editor, as you showed. It is not too much to ask to tell people to set their text editors up correctly. Your projects should have a convention that all files use, just like with code formatting. If someone tries to check in something that's not valid UTF-8, you can have a hook that rejects the commit if you want. Then they can fix it. The format is not stored anywhere, which is why you should have the convention and yell at people who mess up (not literally, be nice.) If you don't know what a file is, you can use aforementioned tools, or try a few encodings and see what works. Yes, it's a hassle, which is why you should set a convention.

19. You can specify LE/BE when you say what encoding something is. As in, you say, hey encode this here text as UTF-16 LE, and it says right-o, here we go!

20. A C char is not aware of the encoding, so that wouldn't have any effect! Some other guy said that a char is always a byte, so answer: no.

21. These are two wildly different things. A BOM is always FFEF or FEFF, depending on endianness, so you can look for those, and chop off the two bytes, I guess. For dealing with accents, look into unicode normal forms, they define a specific way to compose and decompose accents. I'm not sure about your javac woes, there ought to be some way to tell javac what encoding to expect, like you can do with python. It may be the case that javac guesses the encoding based on your locale, and his was different than yours.