|
|
|
|
|
by giovannibonetti
1230 days ago
|
|
Since you are working with raw text, it shouldn't need too much effort. There are a bunch of open source tools to extract text from PDFs. The hard part would be parsing tables and other layout-dependent semantics. You usually start with text coordinates (like HTML elements with absolute position) and have to work backwards from that. I worked for some years in a project for a client that was full of edge cases, because whenever the input PDF (from a government agency) would have a slight layout change the parser would break. It took multiple iterations to make it robust enough. |
|