Hacker News new | ask | show | jobs
by jrochkind1 4603 days ago
> Never assume that the data you’re dealing with is UTF-8 — ASCII appears identical unless you view the hex to see if each character is taking one byte (ASCII) or three (UTF-8).

Um, what? This is just wrong. ascii-equivalent characters only take one byte in UTF-8. Other characters may take two, or three, bytes.

If the author actually viewed text in ascii that, when in UTF-8, had three-bytes per character.... I don't know what they were looking at, but it wasn't UTF-8.

1 comments

Also, if the data is ASCII, and includes only legal 7-bit ASCII characters -- it is simultaneously ALSO valid and legal UTF-8. UTF-8 is a superset of ASCII.

I'm not sure this guy understands what he's talking about.