Hacker News new | ask | show | jobs
by erik_seaberg 2459 days ago
It would have been nice if every well-encoded Unicode document started with BOM and every legacy doc did not, instead of having to guess whether a doc is more likely UTF-8 or Latin-1.
1 comments

Then concatenating to valid Unicode documents would no longer be valid Unicode. That is bad. And ASCII text would no longer be a valid UTF-8 encoded Unicode document. That is bad. And even when everything has finally switched to UTF-8 every tool ever will still need to handle the BOM. That is bad.

Guessing between valid UTF-8 and Latin-1 is only ever ambiguous when there are multiple non-ASCII characters in a row and all those sequences are made up of a lead byte with the correct number of trailing bytes. How often is that a problem for you in practice?