Hacker News new | ask | show | jobs
by ivansavz 2339 days ago
Also in the extracting-structured-data-from-PDFs solution space, there is Parsr which was recently posted on HN: https://github.com/axa-group/Parsr see https://news.ycombinator.com/item?id=22035258 It's based on a pipeline of various js modules and pluggable backends (e.g. tesseract, GCP cloud vision, Abbyy API, etc.)

For tables with numbers in them, it worked pretty well, but I'm yet to find a tool that can parse/understand documents where the entire page is a table layout with lots of merged cells. I think even for humans it's hard to understand the structure in those cases...