|
|
|
|
|
by celestialcheese
1124 days ago
|
|
exactly. Just sent raw tesseract output, no formatting or "fix the OCR text" step. So the data looked like: ```
col1col2col3\nrow label\tdatapoint1\tdatapoint2...
```
Very messy. I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy. |
|
Yeah my use cases are in the really bad category - I’ve been building parsers for a while, and I’ve basically given up to manually stating rows of interest if present logic. Camelot got so close but I ended up building my own control layer to pdfminer.six to accommodate (I’d recommend Camelot if you’re still exploring). It absolutely sucks needing to be so specific out the gate, but at least the context rarely changes.