Hacker News new | ask | show | jobs
by Animats 2344 days ago
Table extraction has been a feature of better OCR programs for at least a decade. It's easier than the OCR part. Look up "OCR table" for examples, products, code, papers, etc.
6 comments

You'd think that until you try them with tables that contain empty cells that you still need recognized or tables that span multiple pages. I wouldn't say this has been solved for a decade.
I wish it was, but it isn't. There are various kinds of tables, that may have delimited lines or not, or they may be unaligned cells, each showing a key and a value... If you actually have in mind some solution that works well (either a paper, a github project, a commercial product) I'd be eager to know
You're wrong.Robust and easy to use table extraction might be solvable, but from a business perspective it isn't solved.
Did you try https://extracttable.com

The mentioned service is not perfect either. There are always limitations, minimizing is the key.

P.s: I work with the team at extracttable

It does not work reliably and the quality is not something you can only sell as an addon feature. This is what Abbyy does for example.
Gridded tables is not too hard, but once you remove the grid lines, even a portion of them, it becomes a complete crap shoot.
Then go ahead, make a table extractor for PDF and get very rich. A lot of people have tried.