Hacker News new | ask | show | jobs
by ztravis 2010 days ago
My guess is that the output PDF is still valid, but that an embedded (subset) font has had its `ToUnicode` map stripped, so that there's no link between the character codes used in the text elements and the "actual" characters they represent (there are also other ways this corruption could happen, but dropping or mangling the `ToUnicode` map seems like a likely cause).
2 comments

This is almost certainly it. I've seen similar issues with copy/paste from poorly constructed PDFs, often ones generated by "print to PDF" features.
Very old LaTeX PDFs tend to have this issue too. Chances are pretty slim for profs to edit PDFs witb Preview, I think…
Yep, and in that case it's because those PDFs were often generated through really horrifying pipelines (e.g. TeX to DVI to PS to PDF). Under some workflows, the resulting document wouldn't even contain any characters, as far as PDF was concerned -- it'd just be a bunch of vectors.
Or not even vectors, but lots of little bitmaps. It's really awful.
I agree. If you look closely, you can see certain patterns repeating, they’re just not English letters. But it definitely looks like natural language, and not random binary dump.
Also look at the spaces. The length of the words is the same on both texts. So the content is still present just the characters got shifted.