Hacker News new | ask | show | jobs
by OskarS 2397 days ago
Yes, it’s widely used. Many text editors insert a UTF-8 BOM as the first character in a text file to signal that the encoding is UTF-8. It’s technically pointless since UTF-8 doesn’t depend on endianness, but since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint.

You can occasionally see it in git diffs as U+FEFF, or if you open a text file in a hex editor as EF BB BF

1 comments

>since UTF-8 doesn’t have a “magic number” to identify itself, the convention is to use the BOM codepoint

Neither does any other of the hundreds of existing text encodings.

It's debatable how much of a magic number it's supposed to be anyway, considering that few people have insisted on having magic numbers in text files, and that you get the BOM at the beginning by simply naively converting a UCS-2/UTF-16 file codepoint by codepoint (and vice versa, enforce it to be there if you ever happen to do the conversion the other way around because of course you're conversion couldn't include that extra logic in it).

The nice thing about the BOM is you can't get it accidentally in an ASCII file - all the bytes have the upper bit set but all ASCII characters have that bit as zero. It makes an excellent magic number for that reason. It's probably just as unlikely to come up in other encodings that use the upper bit.