|
|
|
|
|
by perturbation
2803 days ago
|
|
It would be nice to benchmark the text extraction to a baseline method, say with Apache Tika (https://tika.apache.org/). I would expect the deep learning approach to outperform traditional approaches in terms of accuracy, but it would be good to see accuracy vs. CPU / memory used, etc. |
|
A better comparison would be against Tesseract or ABBYY FineReader.
EDIT: I wasn't aware that Tika now embeds Tesseract.[1] Still, it's a simple wrapper so the real comparison is against Tesseract.
[1] https://wiki.apache.org/tika/TikaOCR