| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vikp 890 days ago
	You may want to try https://github.com/VikParuchuri/surya (I'm the author). I've only benchmarked against tesseract, but it outperforms it by a lot (benchmarks in repo). Happy to discuss. You could also try https://github.com/VikParuchuri/marker for general PDF parsing (I'm also the author) - it seems like you're more focused on tables.

2 comments

raffraffraff 890 days ago

How does surya compare to AWS Textract? A previous employer went through a bunch of different OCRs and ended up using Textract because they found it to be the most accurate overall.

link

vikp 890 days ago

I unfortunately haven't had time to benchmark against more than tesseract.

link

kergonath 890 days ago

That’s my experience as well. I am still looking for alternatives, but Textract is now the baseline.

link

pryelluw 890 days ago

Thanks for sharing.

link