| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by PeterisP 4648 days ago
	The concluding statement is a bit wierd: "ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8)" That isn't accurate, ASCII text would appear identical even if 'you view the hex', because it is identical in UTF-8, that's the whole point of UTF-8. You'd have to look at non-ASCII characters to see how they're encoded.

2 comments

ygra 4648 days ago

Notepad also doesn't save as ASCII by default but »ANSI«, the default legacy codepage configured for your Windows installation.

link

apaprocki 4648 days ago

Yes, the default Windows code page -- many pieces of software don't realize that registry keys, file paths, etc. are all encoded in a different code page if you are running, for example, Japanese Windows. (Also, it isn't exactly Shift-JIS...)

link

lambda 4648 days ago

Yeah, I think he might have meant ISO-8859-1 or Windows 1252 rather than ASCII; but still, all of those characters would take up two bytes in UTF-8, not 3, unless you used combining diacritics rather than precomposed.

link

jrochkind1 4648 days ago

all those characters? You mean except the straight ascii-compatible ones, which will just take up one byte.

link

lambda 4648 days ago

Yes, I meant all of the characters outside of the ASCII range. As in, there are no characters in ISO-8859-1 which take up more than 2 bytes in UTF-8. I guess there are a few in Windows-1252 which take up more than two bytes (like the Euro sign), so it's possible he meant Windows-1252 rather than ASCII.

link