Hacker News new | ask | show | jobs
by noamyoungerm 3463 days ago
It's totally possible (and a relatively frequent occurrence) to have pdfs where the order of characters in the code has no relationship at all to how those same characters are laid out visually on the page. Anything marginally more complex than a series of paragraphs with no formatting at all basically requires you to render out the whole pdf and figure out the order that you are actually supposed to read the characters in.
1 comments

Yup. For instance, Word's PDF output, has an absolutely positioned textbox for every word (and sometimes sub-word). This is for kerning purposes. If you want your original text back, you're going to need some OCR-like preprocessing and heuristics to guess what textboxes belong to the same line. If you have multiple columns, good luck distinguishing them from accidental rivers.

It's not impossible, but I wouldn't know immediately what tools get this most right. And it's always a lossy operation going back and forth.