| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chazeon 1308 days ago
	On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution. [1]: https://github.com/jcushman/pdfquery

2 comments

There is a command line utility (pdf2text) that will also parse the pdf to an XML tree and you can query with XPaths. I found it works well.

This is great, thank you for posting it.