| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by celestialcheese 1171 days ago

exactly. Just sent raw tesseract output, no formatting or "fix the OCR text" step. So the data looked like:

``` col1col2col3\nrow label\tdatapoint1\tdatapoint2... ``` Very messy.

I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy.

1 comments

anonymouse008 1171 days ago

That’s still super cool!

Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.

link

pplante 1171 days ago

What is the source of these nasty docs? I am also working on a layer above pdfminer.six to parse tables. It seems like this task is never done. LLMs have had mixed results for me too. I am focused on documents containing invoices, income statements, etc from the real estate industry.

My email is in my profile if you want to reach out and compare notes!

link