Hacker News new | ask | show | jobs
by folli 452 days ago
Hijacking this thread: what's currently the cheapest way to get structured data out of a PDF?

I assume there's some reasonable tool out there to convert PDFs to Markup and than feed it to some LLM API with okay costs (Gemini? DeepSeek?). Any suggestions?

2 comments

https://mistral.ai/news/mistral-ocr , recent release. Its been a step function improvement for my pipelines
I’m feeding pdfs directly to Gemini to extract tables and so far the results are pretty good. There was a post on HN a few days ago about using Gemini for this task.