Hacker News new | ask | show | jobs
by bringtheaction 3045 days ago
I use the pdf.load and pdf.tree.write of PDFQuery.

https://github.com/jcushman/pdfquery

> PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want.

Here is a minimal Python script to dump the XML tree so you can load it in whatever other language you use and work with it from there.

    #!/usr/bin/env python3

    from pdfquery import PDFQuery

    pdf = PDFQuery("some_document.pdf")
    pdf.load()
    pdf.tree.write("some_document.xml", pretty_print=True, encoding="utf-8")
It doesn’t work perfectly with all documents but it works well with many. Give it a try.

    pip3 install pdfquery