Hacker News new | ask | show | jobs
by ramoz 878 days ago
For me, PyMuPDF/fitz has been the best way to retain natural reading order and set dynamic enough rules to extract text in complex layouts.

None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.

1 comments

Same here, fitz is great, it does well enough out of the box that I can apply some simple heuristics for things like joining/splitting paragraphs where it makes a mistake and extract drawings and such and get pretty close to 100% accuracy on the output.

The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.