Hacker News new | ask | show | jobs
by edgyquant 745 days ago
Using python to dump the PDF to text then use llama3 (8B) to parse
1 comments

The "Using python to dump the PDF to text" dramatically underestimates how hard this is.

Tables and especially multi-column PDFs often need one-off handling and - worse - you don't know when one is being misparsed until you start getting weird search results. At that point you need to debug your entire search pipeline, which isn't fun!