Hacker News new | ask | show | jobs
by jfk13 1964 days ago
This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8.

One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150.

But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.

1 comments

And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.