Hacker News new | ask | show | jobs
by aardvark179 921 days ago
It’s usually easy to extract individual strings from a PDF, normally single lines, but it can be quite hard to understand how those form into longer paragraphs, especially if the page has multiple columns and inline figures.

It’s also easy to create a PDF that it is hard to extract text from, not through an deliberate attempt to enforce copy protection but often simply from attempts to compress the size of the file as you may not want to store the entirety of a font in a document.

I’ve been on both ends of this, generating documents and consuming them, and I think we probably created something that allowed for much easier text extraction, but it’s far too late now.