Hacker News new | ask | show | jobs
by vikp 842 days ago
You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss.

You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.

2 comments

How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.
I unfortunately haven't had time to benchmark against more than tesseract.
That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.
Thanks for sharing.