Hacker News new | ask | show | jobs
by gharman 1148 days ago
Indeed! I built a system just last year with - count em - three parsers to deal with PDF table extraction, including one built on TableTransformer. And then when GPT4 came out I just copy pasted a PDF into it as-is and darned if it didn’t do at least as good a job.

Now I can’t do this in earnest because of document privacy issues but I’ve diving down the rabbit hole of how small can we go and still get decent results. Spoiler: gpt2 is too small. :-)

1 comments

If you were asked to extract lists or tables from html pages only, how would you go?

I was thinking: a) use the metric used in TableTransformer to detect the structured data. b) use the Markup LM model, maybe mixed with TableTransformer. c) find a way to work directly with GPT4.