Hacker News new | ask | show | jobs
by Zuiii 744 days ago
Tesseract's true value is being one apt-get command away (i.e. opensource). Does Debian host more modern OCR systems in their repos?
2 comments

Tesseract the tool is one apt-get away but the trained models are not, and I've found that they are a starting point, not a final destination. You still have to do more training on top of them for anything that isn't black text on a crisp white background.
Big mistake on my part; I should clarify I fine-tuned both PaddleOCR and TrOCR on large amounts of data specific to my domain. I cannot speak on the best out of the box “ready to go” solutions (besides cloud ones, which were quite good with the right pre and post processing).