| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cryptonector 279 days ago
	You do not need a BOM for UTF-8. Ever. Byte order issues are not a problem for UTF-8 because UTF-8 is manipulated as a string of _bytes_, not as a string of 16-bit or 32-bit code units. You _do_ need a BOM for UTF-16 and UTF-32.

1 comments

mikelabatt 278 days ago

In a pure UTF-8 world we would not need it, sure. I get that point. But what do you want to do with 40+ years worth of text files that came after 7-bit ASCII, where they may coexist with UTF-8? If we want to preserve our past, the practical solution is that the OS or app has a default character set for 8-bit text encoding, in addition to supporting (and using as a default) UTF-8.

I also agree that "BOM" is the wrong name for an UTF-8... BOM. Byte order is not the issue. But still, it's a header that says that the file, even if empty, is UTF-8. Detecting an 8-bit legacy character set is much more difficult that recognizing (skipping) a BOM.

link

cryptonector 277 days ago

UTF-8 does not need a BOM at all and never needed it, for two reasons:

- first, byte order doesn't affect the UTF-8 encoding,

- second, the codeset metadata problem you're trying to solve is a problem that already existed before and still does after UTF-8 enters the scene -- you just have to know if some text file (or whatever) uses UTF-8, ISO 8859-x, SHIFT-JIS, UTF-16, etc.

The second point addresses your concern, but that metadata has to be out of band. Putting it in-band creates the sorts of problems that others have pointed out, and it creates an annoyance once all non-Unicode locales are gone. And since the goal is to have Unicode replace all other codesets, and since we've made a great deal of progress in that direction, there is no need now to add this wart.

link

mikelabatt 277 days ago

Thanks for your insights. I did change my mind about the need for a BOM (though not about the need to be able to parse/skip it if present).

In a future where everything defaults to UTF-8 it makes sense. This is probably easier to envision in an English-only context where the jump from 7-bit ASCII to UTF-8 is cleaner.

Where I come from, UTF-8 is not always supported. Without a header (or "BOM", though we don't like the name) you don't know in what encoding a text file was meant to be (re-)saved as when it was created. My example of an empty file was meant to illustrate that. But leaning on the Utopian side, I too shall put more energy towards all apps supporting UTF-8 :)

link

cryptonector 276 days ago

Excellent!

Yeah, UTF-8 by default -or better, as the only option- is the dream.

Keep in mind that if you do use a BOM for UTF-16 then it's possible to reliably tell that some file is in UTF-8.

link