Hacker News new | ask | show | jobs
by giovannibonetti 1230 days ago
Since you are working with raw text, it shouldn't need too much effort. There are a bunch of open source tools to extract text from PDFs.

The hard part would be parsing tables and other layout-dependent semantics. You usually start with text coordinates (like HTML elements with absolute position) and have to work backwards from that. I worked for some years in a project for a client that was full of edge cases, because whenever the input PDF (from a government agency) would have a slight layout change the parser would break. It took multiple iterations to make it robust enough.