Hacker News new | ask | show | jobs
by voxadam 3045 days ago
What tools did you end up settling on for PDF data/text extraction? I ask because I have a side project that I've been neglecting for far too long which depends in part on cleanly extracting text from PDF (other formats too but PDFs are by far the most headache inducing).
2 comments

I use the pdf.load and pdf.tree.write of PDFQuery.

https://github.com/jcushman/pdfquery

> PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want.

Here is a minimal Python script to dump the XML tree so you can load it in whatever other language you use and work with it from there.

    #!/usr/bin/env python3

    from pdfquery import PDFQuery

    pdf = PDFQuery("some_document.pdf")
    pdf.load()
    pdf.tree.write("some_document.xml", pretty_print=True, encoding="utf-8")
It doesn’t work perfectly with all documents but it works well with many. Give it a try.

    pip3 install pdfquery