Hacker News new | ask | show | jobs
by nl 745 days ago
The "Using python to dump the PDF to text" dramatically underestimates how hard this is.

Tables and especially multi-column PDFs often need one-off handling and - worse - you don't know when one is being misparsed until you start getting weird search results. At that point you need to debug your entire search pipeline, which isn't fun!