|
|
|
|
|
by TheAceOfHearts
2736 days ago
|
|
Do you have any tool suggestions or general advice for someone trying to do this? A while back I was trying to extract text from some government PDFs in order to make the information more accessible for others, but I became a bit overwhelmed when I started reading up on PDFs. |
|
The rest kind of it kind of just comes down to using good software engineering practices to help keep yourself sane. Find useful abstractions for common tasks you need to perform and build a library around them, make sure that your data processing pipeline is designed with enough flexibility to handle inputs in different formats so that adding or modifying parsing logic becomes trivial, etc.
[0] https://www.xpdfreader.com/pdftotext-man.html [1] https://pymupdf.readthedocs.io/en/latest/