| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by raunakchowdhuri 598 days ago

Love the Pubtables work! It's a really useful dataset. Their data comes from existing annotations from scientific papers, so in our experience it doesn't include a lot of the hardest cases that a lot of methods fail at today. The annotations are computer generated instead of manually labeled, so you don't have things like scanned and rotated images or a lot of diversity in languages.

I'd encourage you to take a look at some of our data points to compare for yourself! Link: huggingface.co/spaces/reducto/rd_table_bench

In terms of the overall importance of table extraction, we've found it to be a key bottleneck for folks looking to do document parsing. It's up there amongst the hardest problems in the space alongside complex form region parsing. I don't have the exact statistics handy, but I'd estimate that ~25% of the pages we parse have some hairy tables in them!