| I’ve done a bit of PDF wrangling in Python, so figured I’d describe the lay of the land. PyPDF [1] is great for reading and writing PDF files, especially dealing with pages, but it’s not great for generating paths, shapes, graphics, etc. However, reportlab [2] has a great API for generating those things, but is lacking in the file IO and page management department. But the content streams it generates can be plugged into PyPDF pretty easily. Finally, there’s pdfplumber which does an amazing job of parsing tabular data from PDF structures, and pytesseract which can perform OCR on PDFs that are actually just image data rather than structured data. There’s not really a one-stop-shop for PDFs, but some pretty good tools that can be combined to get the job done. Will be curious to see how this project develops! [1] https://pypi.org/project/PyPDF2/ [2] https://pypi.org/project/reportlab/ |