|
|
|
|
|
by vikp
807 days ago
|
|
This looks great! You might be interested in surya - https://github.com/VikParuchuri/surya (I'm the author). It does OCR (much more accurate than tesseract), layout analysis, and text detection. The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU. You could probably replace pymupdf, tesseract, and some layout heuristics with this. Happy to discuss more, feel free to email me (in profile). |
|