Hacker News new | ask | show | jobs
by notacop31337 1303 days ago
My understanding is that this is largely because you're fighting an adversarial format provider in Adobe, I've read a few papers and journal entries on file format polyglotting, with some focus on PDF and approaches are constantly shifting in nature due to Adobe mooting pathways to success, I think it's partly for security and also IMO partly for obscurity as PDF is a horrific format in all reality except for human visual interpretation. Many organisations and industries ONLY produce public data via PDF and it makes the parsing of that information a far more difficult task (again, I suspect by design). After trying a few parsing options for PDF, I've come to hate it as a format. Luckily though, some options from a few cloud providers seem to be really hammering the problems complexity down, but the cost of the solutions can be very steep.
2 comments

PDF is a subset of Postscript, which is a full-blown programming language disguised as a page description language.

People who think of the format as "adversarial" are wrong. Adobe never gave a shit about being adversarial in that sense.

The problem is that PDF is not a file format, it's a defined subset of a programming language (PostScript) used for portable rendering with fidelity. It's portable, in the sense that it should render the same way on whatever device it's rendered on (printed on a page or mastered to a display). And it's portable because it doesn't allow any postscript job-level commands, and it tries to ensure that each PDF File is standalone and can be concatenated together into a multi-page document or embedded in another document.

Postscript (and PDF) are also postfix, which can be confusing.

> PDF is a subset of Postscript

That’s a bit of an oversimplification. There’s a whole layer of structure atop the postscript subset. Much software deals only with that layer, never looking into the chunks of rendering code. That’s plenty complicated already!

> Postscript (and PDF) are also postfix, which can be confusing.

I handwrote quite a bit of postscript wsy back when. It wasn’t that bad, really, you just had to keep the state of the stack firmly in your head. Being used to HP scientific calculators helped. I would never dream of handwriting a pdf file, though. Even the low level parts are harder to deal with, since most command names have been shortened to a single letter for efficiency.

Postfix is fine, you get used to it fairly quickly. But when you've finished and want to go the the toilet, you walk there backwards.
What libraries do you see as being SOTA? Fitz? Tika?

My hope is that computer vision + OCR will solve this once and for all in near future.

To be 100% honest it's been a while since I looked into libraries for it, so I couldn't say.

Your second comment rings true, and in my opinion, we are there. Highly recommend throwing some PDFs at AWS Textract and checking out the quality, it wasn't there a few years ago, can safely state it's there now though. I threw stuff at it that previously would just spit out trash, and it handled it fairly well, specifically for table data extraction (I was looking at public stock market quarterly reports).

Cost is the kicker for me, 1000 pages for $15, adds up fairly quickly at any sort of scale!

OCR is built into Adobe's PDF reader, issue is it's 15$ a month.

I really want to see OCR become easier to use, but I don't know why it's such a hard problem in the first place.

There is the python library ocrmypdf https://ocrmypdf.readthedocs.io/en/latest/ that works really well. I have found the results comparable to Adobe in accuracy.

I believe it uses tesseract, ghostscript and some other libraries.

Speaking of ghostscript, one way to deal with problematic PDFs is to print them to file and deal with the result instead.

Any open source apps integrate this ?

I'd love to just be able to search a PDF document for a string and get a list of results.