| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gharman 1148 days ago
	Indeed! I built a system just last year with - count em - three parsers to deal with PDF table extraction, including one built on TableTransformer. And then when GPT4 came out I just copy pasted a PDF into it as-is and darned if it didn’t do at least as good a job. Now I can’t do this in earnest because of document privacy issues but I’ve diving down the rabbit hole of how small can we go and still get decent results. Spoiler: gpt2 is too small. :-)

1 comments

ekabod 1139 days ago

If you were asked to extract lists or tables from html pages only, how would you go?

I was thinking: a) use the metric used in TableTransformer to detect the structured data. b) use the Markup LM model, maybe mixed with TableTransformer. c) find a way to work directly with GPT4.

link