Hacker News new | ask | show | jobs
by YSFEJ4SWJUVU6 2397 days ago
>since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint

Neither does any other of the hundreds of existing text encodings.

It's debatable how much of a magic number it's supposed to be anyway, considering that few people have insisted on having magic numbers in text files, and that you get the BOM at the beginning by simply naively converting a UCS-2/UTF-16 file codepoint by codepoint (and vice versa, enforce it to be there if you ever happen to do the conversion the other way around because of course you're conversion couldn't include that extra logic in it).

1 comments

The nice thing about the BOM is you can't get it accidentally in an ASCII file - all the bytes have the upper bit set but all ASCII characters have that bit as zero. It makes an excellent magic number for that reason. It's probably just as unlikely to come up in other encodings that use the upper bit.