| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jcranmer 3476 days ago

UTF-8 has many nice properties. One of the nicest is that most random binary strings are not valid UTF-8. In fact, the structure of UTF-8 strings is such that, if a file parses as UTF-8 without error, then it is almost certainly UTF-8.

If it's merely ASCII, it doesn't matter. Nearly every charset contains all valid ASCII texts as a strict subset. UTF-7, UTF-16, UTF-32, and EBCDIC are the major counterexamples, and UTF-7 and EBCDIC aren't going to come up unless you're actually expecting them to. (Technically, ISO-2022 charsets can introduce non-ASCII characters without use of high bit characters, since they switch modes using the ASCII ESC character as part of the sequence. In practice, ESC isn't going to come up in most ASCII text and ISO-2022-JP (the only one still in major use) will frequently use 8-bit characters anyways).

The only useful purpose of a BOM is to distinguish between UTF-16LE and UTF-16BE, and even then it's discouraged in favor of actually maintaining labels (or not storing in UTF-16 in the first place). You can detect UTF-8 in practice without a BOM quite easily, and it's only Microsoft who feels obliged to need them.