Hacker News new | ask | show | jobs
by Kwpolska 2514 days ago
> How can you tell if a doc is US ASCII or UTF-8 if there are no other indications in the content?

If the document contains no codepoints above 0x7F, it is both US-ASCII and UTF-8 at the same time. If the document decodes as valid UTF-8, it’s more likely to be UTF-8 than whatever national encoding it might be (latin1? windows-1252? equivalents for other countries?)

With the BOM, you get more serious problems. Everyone on the way needs to know about it, and that it needs to be ignored when reading, but probably not when writing. I remember the olden days of modifying .php files from CMSes with notepad.exe, which happily, silently added a BOM to the file, and now suddenly your website displays three weird characters at the very top of the page.