| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Animats 2344 days ago
	Table extraction has been a feature of better OCR programs for at least a decade. It's easier than the OCR part. Look up "OCR table" for examples, products, code, papers, etc.

6 comments

curiousgal 2344 days ago

You'd think that until you try them with tables that contain empty cells that you still need recognized or tables that span multiple pages. I wouldn't say this has been solved for a decade.

link

pathsjs 2344 days ago

I wish it was, but it isn't. There are various kinds of tables, that may have delimited lines or not, or they may be unaligned cells, each showing a key and a value... If you actually have in mind some solution that works well (either a paper, a github project, a commercial product) I'd be eager to know

link

m1sta_ 2344 days ago

You're wrong.Robust and easy to use table extraction might be solvable, but from a business perspective it isn't solved.

link

saradhi 2344 days ago

Did you try https://extracttable.com

The mentioned service is not perfect either. There are always limitations, minimizing is the key.

P.s: I work with the team at extracttable

link

tastyminerals 2344 days ago

It does not work reliably and the quality is not something you can only sell as an addon feature. This is what Abbyy does for example.

link

tensor 2344 days ago

Gridded tables is not too hard, but once you remove the grid lines, even a portion of them, it becomes a complete crap shoot.

link

Ididntdothis 2344 days ago

Then go ahead, make a table extractor for PDF and get very rich. A lot of people have tried.

link