Hacker News new | ask | show | jobs
by layer8 60 days ago
XML arguably isn’t plain text, but a binary format: If you add/change the encoding declaration on the first line, the remaining bytes will be interpreted differently. Unless you process it as a function of its declared (or auto-detected, see below) encoding, you have to treat it as a binary file.

In the absence of an encoding declaration, the encoding is in some cases detected automatically based on the first four bytes: https://www.w3.org/TR/xml/#sec-guessing-no-ext-info Again, that means that XML is a binary format.

1 comments

Another way that the character encoding could be declared is ISO 2022. When using ISO 2022, the declaration of UTF-8 is <1B 25 47>, rather than the <EF BB BF> that XML and some other formats use.

However, whether you do it that way or another way, I think that the encoding declaration should not be omitted unless it is purely ASCII in which case the encoding declaration should be omitted.