|
|
|
|
|
by yfontana
406 days ago
|
|
I've been working on extracting text from some 20 million PDFs, with just about every type of layout you can imagine. We're using a similar approach (segmentation / OCR), but with PyMuPDF. The full extract is projected to run for several days on a GPU cluster, at a cost of like 20-30k (can't remember the exact number but it's in that ballpark). When you can afford this kind of compute, text extraction from PDFs isn't quite a fully solved problem, but we're most of the way there. What the article in the OP tries to do is, as far as I understand, somewhat different. It's trying to use much simpler heuristics to get acceptable results cheaper and faster, and this is definitely an open issue. |
|