Hacker News new | ask | show | jobs
Ask HN: How Do I Get Started with OCR(Optical Character Recognition)?
8 points by muralimadhu 2499 days ago
I have no background in machine learning or computer vision. What I do have is a problem statement. I want to be able to parse and get structured text out of financial documents like W2s and Paystubs. For ex)parse out company name, salary etc from a W2. Off the shelf solutions like AWS Textract doesnt work very well. So far I have only been treating OCR as a blackbox. If I were to build an OCR service myself for a specialized set of financial documents, what theory and tools would I need to learning assuming I have a CS background, but not an ML background? Thanks in advance
2 comments

What is the reason that you want to roll your own?

Is it because you want to own the IP or for learning purposes?

I would continue searching for an off-the-shelf solution, particularly if you're able to throw a small amount of money at the problem. DocParser is probably what I'd try first.

How complicated it is to roll your own really depends on how much variation there is in your documents (in a variety of ways, from "What size, shape, and resolution is the image?" to "Is what you're trying to extract always in the same relative location?").

The data is not always in the same location, and the images/documents are user submitted, so no guarantees about resolution. I'll continue to look for off-the-shelf solutions. If I were to invest in doing this myself, what would you recommend? Are there any books/courses that'll help me with the foundations?
I strongly caution you against doing it yourself, but if you're determined, here are some things you'll need to deal with at high level.

1) Image standardization: All user-submitted images should be the same resolution and the same size on at least one axis.

2) Pre-processing. Rotation, skew, and binarization, mainly. You could do this yourself or let an OCR API handle it for you (many do, both local and cloud).

Then there's basically two paths in a rudimentary solution. You could do full-page OCR, store the coordinates of extracted n-grams, and then inspect the area where you think or know the text you want to extract will be. Or, you could crop the image to a rectangle where you think or know the text will be, OCR the rectangle, and see if the text is there. The latter is computationally cheaper but the rectangle isn't guaranteed to contain the text of interest-- assuming it's there at all-- and you won't easily be able to define things in relation to each other.

The top commercial products are a lot more complicated than this, though. A "template" for extracting Box 1 from a W-2 would do something like this:

- Crop the document to just the top third and OCR it

- Match the string "1 Wages, tips, or other compensation" which itself is something like "Try to find that exact string. No? Look for a 1 or something that resembles a 1. Is there an n-gram to the right of it beginning with the word Wages? No? OK how about wages in lower case. No? OK restrict the pattern match to just 1 line. Does that work? OK good. Does something like the word 'tips' appear in the n-gram? Ok that's probably it."

- Underneath that line, offset x to the right, look for a string of digits matching this regex.

The string matching is done probabilistically in branches based on the specified rules (like "there's this n-gram over here that seems like what you're looking for but you said it should be in this other location, so we'll mark that as a possibility but it's probably not what you're looking for") and with the assistance of a really powerful dictionary.

If you don't want to totally roll your own and you happen to write Java, check out OpenKM, which I believe has some of the necessary abstractions built in for zonal OCR.

Thanks for the detailed response