|
|
|
|
|
by anonymouse008
1126 days ago
|
|
> text extracted using tesseract You're saying 'the text' without normalizing the rows and columns (basically the tab, space or newline delimited text with sporadic lines per row) was all you needed to send? I still have to normalize my tables even for GPT-4, I guess because I have weird merged rows and columns that attempt to do grouping info on top of the table data itself. |
|
``` col1col2col3\nrow label\tdatapoint1\tdatapoint2... ``` Very messy.
I don't think this is generalizable with the same 100% accuracy across any OCR output (they can be _really_ bad). I'm still planning on doing a first pass with a better Table OCR system like Textract, DocumentAI, PaddPaddle Table, etc which should improve accuracy.