Hacker News new | ask | show | jobs
by SAI_Peregrinus 212 days ago
UTF-8 is a text format with no BOM. Just like ASCII doesn't support a BOM. The BOM is a UTF-16 or UTF-32 thing, so "UTF-8 with BOM" is a binary file that happens to contain some UTF-8 strings as well. Since it's not a text file, it makes sense that utilities expecting text files don't handle it.
1 comments

Eh? A utf8 file starting with ZERO WIDTH NO-BREAK SPACE is not a text file? How do you figure that?
If it starts with 0xFE 0xFF, but is otherwise UTF-8 instead of UTF-16, it's a binary file. If it starts with 0xEF 0xBB 0xBF, it's a text file with a ZERO WIDTH NO-BREAK SPACE at the start.
> If it starts with 0xFE 0xFF, but is otherwise UTF-8 instead of UTF-16, it's a binary file

Sure, but who does this? All the Microsoft tooling writes 0xEF 0xBB 0xBF if you output utf8 with a BOM.