Hacker News new | ask | show | jobs
by bob1029 499 days ago
Parsing data out of arbitrary PDFs is a cursed mission. PDF can contain images, so you might as well target JPEG directly.

OCR can take you pretty far depending on expectations, but it's never quite far enough in my experience.

1 comments

That's been our experience as well. Just scrapping any of the metadata associated with the PDF and treating it like an image. Since you never know when a document has a screenshot of an excel table inside.

The .NORM files (https://xkcd.com/2116)