|
|
|
|
|
by willvarfar
2339 days ago
|
|
I've worked with several companies that try to parse things in PDF documents, extracting tables and paragraphs etc. This is actually challenging because a PDF is a large bag of words and fragments of words with x y positions. There is a particularly popular word processor that emits individual characters. Just determining that two fragments are part of the same word is challenging as is detecting bullet points, etc. The AI approaches are definitely still worse than human-written rules. I can infer - and I've chatted with the devs to confirm - from the quality of the text and table extraction whether the company is using a modern NN approach or someone has sat down and handwritten some simple rules that understand indents and baselines etc. |
|
[1] https://edinburghhacklab.com/2013/09/probabalistic-scraping-...