| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by _bohm 2736 days ago

Sure! In terms of raw text extraction (for documents that don't require OCR), the most useful tools I've worked with have been pdftotext [0] and PyMuPDF [1]. For extracting useful details, really, my best advice is to make sure that your regex skills are sharp. I've been meaning to explore the possibility of using NLP tools for named entity recognition, but unfortunately I don't have much of a background there.

The rest kind of it kind of just comes down to using good software engineering practices to help keep yourself sane. Find useful abstractions for common tasks you need to perform and build a library around them, make sure that your data processing pipeline is designed with enough flexibility to handle inputs in different formats so that adding or modifying parsing logic becomes trivial, etc.

[0] https://www.xpdfreader.com/pdftotext-man.html [1] https://pymupdf.readthedocs.io/en/latest/

1 comments

ocrcustomserver 2736 days ago

pdfminer is another good library (Python).

link