Hacker News new | ask | show | jobs
by PeterisP 4602 days ago
The concluding statement is a bit wierd: "ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)"

That isn't accurate, ASCII text would appear identical even if 'you view the hex', because it is identical in UTF-8, that's the whole point of UTF-8. You'd have to look at non-ASCII characters to see how they're encoded.

2 comments

Notepad also doesn't save as ASCII by default but »ANSI«, the default legacy codepage configured for your Windows installation.
Yes, the default Windows code page -- many pieces of software don't realize that registry keys, file paths, etc. are all encoded in a different code page if you are running, for example, Japanese Windows. (Also, it isn't exactly Shift-JIS...)
Yeah, I think he might have meant ISO-8859-1 or Windows 1252 rather than ASCII; but still, all of those characters would take up two bytes in UTF-8, not 3, unless you used combining diacritics rather than precomposed.
all those characters? You mean except the straight ascii-compatible ones, which will just take up one byte.
Yes, I meant all of the characters outside of the ASCII range. As in, there are no characters in ISO-8859-1 which take up more than 2 bytes in UTF-8. I guess there are a few in Windows-1252 which take up more than two bytes (like the Euro sign), so it's possible he meant Windows-1252 rather than ASCII.