|
|
|
|
|
by aliosm
676 days ago
|
|
I'm working on Arabic OCR for a massive collection of books and pages (over 13 million pages so far). I've tried multiple open-source models and projects, including Tesseract, Surya, and a Nougat small model fine-tuned for Arabic. However, none of them matched the latency and accuracy of Google OCR. As a result, I developed a Python package called tahweel (https://github.com/ieasybooks/tahweel), which leverages Google Cloud Platform's Service Accounts to run OCR and provides page-level output. With the default settings, it can process a page per second. Although it's not open-source, it outperforms the other solutions by a significant margin. For example, OCRing a PDF file using Surya on a machine with a 3060 GPU takes about the same amount of time as using the tool I mentioned, but it consumes more power and hardware resources while delivering worse results. This has been my experience with Arabic OCR specifically; I'm not sure if English OCR faces the same challenges. |
|