| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ztravis 2010 days ago
	My guess is that the output PDF is still valid, but that an embedded (subset) font has had its `ToUnicode` map stripped, so that there's no link between the character codes used in the text elements and the "actual" characters they represent (there are also other ways this corruption could happen, but dropping or mangling the `ToUnicode` map seems like a likely cause).

2 comments

duskwuff 2010 days ago

This is almost certainly it. I've seen similar issues with copy/paste from poorly constructed PDFs, often ones generated by "print to PDF" features.

link

arthur2e5 2010 days ago

Very old LaTeX PDFs tend to have this issue too. Chances are pretty slim for profs to edit PDFs witb Preview, I think…

link

duskwuff 2010 days ago

Yep, and in that case it's because those PDFs were often generated through really horrifying pipelines (e.g. TeX to DVI to PS to PDF). Under some workflows, the resulting document wouldn't even contain any characters, as far as PDF was concerned -- it'd just be a bunch of vectors.

link

mkl 2009 days ago

Or not even vectors, but lots of little bitmaps. It's really awful.

link

lrossi 2010 days ago

I agree. If you look closely, you can see certain patterns repeating, they’re just not English letters. But it definitely looks like natural language, and not random binary dump.

link

Marioheld 2009 days ago

Also look at the spaces. The length of the words is the same on both texts. So the content is still present just the characters got shifted.

link