Hacker News new | ask | show | jobs
by jerkstate 82 days ago
this is actually a great question - I just extract the text with PyPDF, but did a brief search on the functionality I'd like to have (convert math equations to LaTeX, extract images, reformat in markdown, extract data from charts) and it looks like there are a couple of promising Python libs like Docling and Marker.. I should really improve this part of my workflow.
1 comments

after looking into it for a little while, Docling and Marker work pretty well but are very slow. I haven't found anything else that extracts math suitably. It takes 10+ minutes per pdf, so I'm going to run it on a batch of these papers overnight and create my own little gaussian splatting RAG database. It's really too bad PDF is so terrible.