Hacker News new | ask | show | jobs
by gbear0 1968 days ago
Actually I can think of one place I know I've seen explicit null characters in text, and that's a Facebook Business Record pdf. If you download your FB data the text instructions of all the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER!

I found it extremely annoying at first cause I was trying to copy/paste the stream chunks around and it wouldn't copy anything after the fist null. Then I realized this was probably a security hack in the hopes that people couldn't copy the data around (I can't think of any other reason to add these nulls like this otherwise). Funny enough, I opened the PDF in chrome and copy/paste of the selected text works fine. So clearly some readers strip these bad characters, but I can imagine others might not.

1 comments

> the PDF raw contents streams actually have a null character between EVERY SINGLE CHARACTER

That sounds more like you're looking at UTF-16 data and trying to interpret it as ASCII.

It's only the text instructions that have this, not the rest of the text. ie one line of the content looks like this, where it's trying to write the text 'Service'

  BT 0 Tr 0.000000 w ET BT 44.814370 775.487087 Td [(\0S\0e\0r\0v\0i\0c\0e)] TJ ET
This reflects the fact that PDF uses the UTF-16BE encoding form for Unicode text, not UTF-8.

One oddity is that the PDF spec's description of the "Text String Type", e.g. at p.158 in https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdf_ref..., appears to say there should be a leading BOM (U+FEFF) in the string here, but this doesn't seem to be the case in reality. Indeed, adding one may cause issues for at least some PDF readers, according to discussion at https://github.com/tesseract-ocr/tesseract/issues/1150.

But anyhow, in short: that's simply UTF-16BE text being represented as a series of bytes. It's nothing to do with any kind of "security hack", and the null bytes are not "bad characters", they're the high byte of each UTF-16 code unit.

And this is a perfect example of why text should just be treated as blobs that can be displayed via OS functions. General application developers shouldn't have to be experts in every flavour of Unicode encoding just to display some writing.