Hacker News new | ask | show | jobs
by chazeon 1308 days ago
On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution.

[1]: https://github.com/jcushman/pdfquery

2 comments

There is a command line utility (pdf2text) that will also parse the pdf to an XML tree and you can query with XPaths. I found it works well.

https://pdfminersix.readthedocs.io/en/latest/reference/comma...

That makes sense, as "pdfquery" uses pdfminer.six as a dep: https://github.com/jcushman/pdfquery/blob/master/requirement...
This is great, thank you for posting it.