Hacker News new | ask | show | jobs
by programmarchy 1308 days ago
I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land.

PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc.

However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily.

Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data.

There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done.

Will be curious to see how this project develops!

[1] https://pypi.org/project/PyPDF2/

[2] https://pypi.org/project/reportlab/

2 comments

QPDF is a good C++ library for "content preserving" PDF transformations, and is used by the Python PikePDF library.
I've found out the hard way that boxing/unboxing of PDF primitives to Python is _really_ expensive, so that my workflow has been counter-intuitively quite a lot slower than with PyPDF2.
I had the same experience. Thanks for the summary. Need to read that the next time.