| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by yfontana 406 days ago

I've been working on extracting text from some 20 million PDFs, with just about every type of layout you can imagine. We're using a similar approach (segmentation / OCR), but with PyMuPDF.

The full extract is projected to run for several days on a GPU cluster, at a cost of like 20-30k (can't remember the exact number but it's in that ballpark). When you can afford this kind of compute, text extraction from PDFs isn't quite a fully solved problem, but we're most of the way there.

What the article in the OP tries to do is, as far as I understand, somewhat different. It's trying to use much simpler heuristics to get acceptable results cheaper and faster, and this is definitely an open issue.