Hacker News new | ask | show | jobs
by Oras 565 days ago
Thank you, this is a mix of OCR and LLM, I was thinking if there might be a library to avoid using that.

A better approach will be using Textract as it maintains the flow, such as if you have a table going across multiple pages.

Btw, tesseract is not that good in getting accurate data from tables. Use it with caution especially in financial context.

I have made an open source tool to show missing data from tesseract and easy ocr https://github.com/orasik/parsevision/

1 comments

Nice I really liked it!