|
|
|
|
|
by chazeon
1308 days ago
|
|
On the field of PDF parsing, I think the most interesting project I encountered is pdfquery[1], where the PDF is parsed as a XML tree and you can use XPath to query it. You might encounter rough edges when put it into production work but the idea is like “how come i never thought of this” because PDF has some tree like structure and should be a straightforward solution. [1]: https://github.com/jcushman/pdfquery |
|
https://pdfminersix.readthedocs.io/en/latest/reference/comma...