> Never assume that the data you’re dealing with is UTF-8 — ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8).
Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.
If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don't know what they were looking at, but it wasn't UTF-8.
Also, if the data is ASCII, and includes only legal 7-bit ASCII characters -- it is simultaneously ALSO valid and legal UTF-8. UTF-8 is a superset of ASCII.
I'm not sure this guy understands what he's talking about.
The concluding statement is a bit wierd: "ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)"
That isn't accurate, ASCII text would appear identical even if 'you view the hex', because it is identical in UTF-8, that's the whole point of UTF-8. You'd have to look at non-ASCII characters to see how they're encoded.
Yes, the default Windows code page -- many pieces of software don't realize that registry keys, file paths, etc. are all encoded in a different code page if you are running, for example, Japanese Windows. (Also, it isn't exactly Shift-JIS...)
Yeah, I think he might have meant ISO-8859-1 or Windows 1252 rather than ASCII; but still, all of those characters would take up two bytes in UTF-8, not 3, unless you used combining diacritics rather than precomposed.
Yes, I meant all of the characters outside of the ASCII range. As in, there are no characters in ISO-8859-1 which take up more than 2 bytes in UTF-8. I guess there are a few in Windows-1252 which take up more than two bytes (like the Euro sign), so it's possible he meant Windows-1252 rather than ASCII.
Some background not covered in an otherwise pretty good article:
"In general, don’t save a Byte Order Mark (BOM) — it’s not needed for UTF-8, and historically could cause problems."
This attitude comes from agony in processing from UTF-16 files. I interface with a group that finds it hilarious to send me textual data in UTF-16 format and the first hard won lesson you learn with UTF-16 is superficially the default order should be correct 50% of the time if guessed randomly but somehow its always wrong. So say you read one line of a UTF-16 text file and process it accordingly after passing it thru a UTF-16 decoder. OK no problemo, it had a BOM as the first glyph/byte/character/whatever and was converted and interpreted correctly. Then you read another line, just like you'd read a line process a line with ASCII or UTF-8. However they only give me a BOM at the start of a file not a start of line, so invariably I translate that to garbage because the bytes are swapped.
Now there are program methods to analyze the BOM and memorize it. Or read the whole blasted multi-gig file into memory all at once and then de-UTF-16 it all at once and then line by line the file. But fundamentally its a simple one liner sysadmin type job to just shove the file thru a UTF-16 to UTF-8 translator program before it hits my processing system. I already had to unencrypt it, and unzip it, and verify its hash so I know they sent the whole file to me (and correctly), so adding a conversion stage is no big deal.
And this kind of UTF-16 experience is what leads people to do things like say "oh, its unicode? That means I should squirt out BOMs as often as possible" even though that technically only applies to unicode UTF-16 and is not helpful for UTF-8.
I hate to be "that SEO guy", but the OP needs to do some SEO. The submitted title here is nowhere to be seen, which is too bad because it's a great title and one that I would try to Google after forgetting to bookmark this page.
Luckily I do use Pinboard, which auto-grabs the title, if it existed. But this is a helpful reference to many devs who don't read HN, and it's all but obscured.
Oh, one more fun fact: some emoji characters occupy more than one _Unicode_ character, and can be encoded in different ways depending on the device that uses them. (Before they were introduced into Unicode, they used character codes designated for custom platform-specific stuff).
Debugging a text input field where user can enter emoji & RTL text is FUN.
Are there really multi-character emoji? Or is it that they are single characters on an astral plane which are encoded as two code units in UTF-16, and therefore behave rather like two characters if your language uses 16-bit chars?
This is not as strange as it might look at the first glance.
A lot of ordinary characters can be represented as two (or more) Unicode code points - for instance an unaccented Latin letter and a combining accent.
Flags emoji seem more like a hack on the side of the font or text renderer. If you look at the Unicode representation it actually spells out the ISO country code. Some fonts probably define a ligature containing these two characters that looks like a flag instead of two separate Latin characters.
Representation of digits inside keycaps also makes sense to me: it's a normal digit eight (dating back to ASCII) plus a combining character that looks like a keycap.
In what UI framework? When I worked on that, I decided to render them from a different texture that doesn't depend on the current font, but scales to it's size.
Note that some browser do use the <meta charset="UTF-8"> even if the content-type header already sent the charset.
Another thing to add: always open a database connection in the charset of choice.
And if you are a PHP user (like I am): there are still functions that don't support multibyte so be careful.
This is the biggest current driver towards me trying to muster the effort to move off of PHP. Also, I had no end of trouble working with filenames that contained UTF-8 characters using PHP, and had to give up in the end.
Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.
If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don't know what they were looking at, but it wasn't UTF-8.