Hacker News new | ask | show | jobs
by voiper1 1961 days ago
I looked into OCR a while ago for some hundreds of thousands of pages of PDF. All hosted offerings would end up costing quite a bit.

After looking at options and few tests, I figured I'd use https://github.com/jbarlow83/OCRmyPDF It converts the PDF to an image for Tesseract and then recreates the PDF with the text copy-able.

It won't identify the address part of a driver's license, but that wasn't necessary for this project.

2 comments

Few years back I worked with someone to build an Android OCR app.

At the time there were not may apps out there and we partnered with a 3rd party service who did the OCR off the app so our quality of conversion (at the time) was close to state of the art from a mobile once people got comfortable with this method (which of course not everyone did).

We made some decent money as a side project from it but I also started to appreciate the sheer complexity of OCR.

We spent a lot of time fine tuning pre-processing before hitting the OCR engine (e.g. orientation, shading) small changes here made huge impact to performance. We also built various prompts to guide the user on how to take the photo to help. Managing expectations was something we were very conscious off and it was tough.

The unexpected use (but rewarding) use case was when we found people who were blind started to use the app to help with their daily lives - only a few but it was making a real impact to them so we priortized a few features to this segment knowing we were drifting away from maximizing revenue but we were cool with this as it was not a primary income source.

In the end we all moved to other things, more apps / services came on the market, google lens became a thing so we decided to sunset the product and did our best to manage customers through this process.

A rewarding experience overall - lots of lessons were learn that I have used elsewhere in my life since and ticked off' Build an app that made thousands of $' of my bucket list (which yea I should probably review!).

Interesting!

I've been thinking of running OCR on video frames. I'd also like to do speech-to-text extraction for searching my archives later (have about 4TB of video to trawl through, and desire text-based search capabilities). It's an interesting space to explore, but everything's been moving to web-service at a cost-prohibitive model.

Should be able to use ffmpeg[0] to extract a single frame each second/keyframe (doubtful it's worth doing every single frame) and then pass it to tesseract.

For speech to text.. if english, try mozilla's deepspeech? https://github.com/mozilla/DeepSpeech

Might be fun to try.

[0] https://stackoverflow.com/questions/27568254/how-to-extract-...

Yup, was planning to use ffmpeg (or, more likely, OpenCV), and a subset of the frames.

Thanks so much for the tip on DeepSpeech!

@Darkphibre; we are happy to provide you an AI that takes in a video and outputs OCR and speech-to-text. With Base64.ai, you don't have to worry about the implementation details, and focus on your projects. Let's have a meeting to discuss more? https://base64.ai/meeting
For speech-to-text extraction you can try Silero [1].

Free software (AGPL-3.0 License), fast, highly accurate and extremely simple to deploy (I have no affiliation with them).

[1] https://github.com/snakers4/silero-models

Thanks for the heads up! Will definitely check it out.
If you’re looking to index/ process video - maybe we can help. Checkout Vidrovr (https://vidrovr.com)

Full disclosure im one of the founders.