Hacker News new | ask | show | jobs
by gbear0 1964 days ago
It's only the text instructions that have this, not the rest of the text. ie one line of the content looks like this, where it's trying to write the text 'Service'

  BT 0 Tr 0.000000 w ET BT 44.814370 775.487087 Td [(\0S\0e\0r\0v\0i\0c\0e)] TJ ET
1 comments

This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8.

One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150.

But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.

And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.