| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by anonymouse008 1173 days ago
	> text extracted using tesseract You're saying 'the text' without normalizing the rows and columns (basically the tab, space or newline delimited text with sporadic lines per row) was all you needed to send? I still have to normalize my tables even for GPT-4, I guess because I have weird merged rows and columns that attempt to do grouping info on top of the table data itself.

2 comments

celestialcheese 1173 days ago

exactly. Just sent raw tesseract output, no formatting or "fix the OCR text" step. So the data looked like:

``` col1col2col3\nrow label\tdatapoint1\tdatapoint2... ``` Very messy.

I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy.

link

anonymouse008 1173 days ago

That’s still super cool!

Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.

link

pplante 1173 days ago

What is the source of these nasty docs? I am also working on a layer above pdfminer.six to parse tables. It seems like this task is never done. LLMs have had mixed results for me too. I am focused on documents containing invoices, income statements, etc from the real estate industry.

My email is in my profile if you want to reach out and compare notes!

link

swyx 1173 days ago

better - you can do it copy pasting from pdf to gpt on your phone! https://twitter.com/swyx/status/1610247438958481408

link

anonymouse008 1173 days ago

Definitely tried that way too, it didn’t work - my tables are pretty dang dumb. Merged cells, confidence intervals, weird characters in the cell field that change based on the row values - messing up a simple regex test, it’s really a billion dollar company solution but I’m about to punt it to the moon because it’s never fully done.

link