|
|
|
|
|
by voxadam
3045 days ago
|
|
What tools did you end up settling on for PDF data/text extraction? I ask because I have a side project that I've been neglecting for far too long which depends in part on cleanly extracting text from PDF (other formats too but PDFs are by far the most headache inducing). |
|
https://github.com/jcushman/pdfquery
> PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want.
Here is a minimal Python script to dump the XML tree so you can load it in whatever other language you use and work with it from there.
It doesn’t work perfectly with all documents but it works well with many. Give it a try.