|
|
|
|
|
by bringtheaction
3045 days ago
|
|
I use the pdf.load and pdf.tree.write of PDFQuery. https://github.com/jcushman/pdfquery > PDFQuery works by loading a PDF as a pdfminer layout, converting the layout to an etree with lxml.etree, and then applying a pyquery wrapper. All three underlying libraries are exposed, so you can use any of their interfaces to get at the data you want. Here is a minimal Python script to dump the XML tree so you can load it in whatever other language you use and work with it from there. #!/usr/bin/env python3
from pdfquery import PDFQuery
pdf = PDFQuery("some_document.pdf")
pdf.load()
pdf.tree.write("some_document.xml", pretty_print=True, encoding="utf-8")
It doesn’t work perfectly with all documents but it works well with many. Give it a try. pip3 install pdfquery
|
|