|
|
|
|
|
by gingerlime
2757 days ago
|
|
I have a personal flow using tesseract to scan docs into searchable PDFs, but it’s not that accurate. One of the main problems is that some (now most?) of the documents are in German since I live in Germany, but some are in English. There’s a way to choose the language but nothing to auto detect as far as I’m aware. I was hoping for some cloud AI service with superior OCR and simple integration or CLI (push a PDF and download one with OCR embedded). Google seems to be too complicated unfortunately... Any tips?? |
|
Edit: Hell, even average word length is probably going to be a good indicator since German is so agglutinative. Collect some factors like this and I think you'll be able to build a pretty good classifier.