Ask HN: How Do I Get Started with OCR(Optical Character Recognition)?

I strongly caution you against doing it yourself, but if you're determined, here are some things you'll need to deal with at high level.

1) Image standardization: All user-submitted images should be the same resolution and the same size on at least one axis.

2) Pre-processing. Rotation, skew, and binarization, mainly. You could do this yourself or let an OCR API handle it for you (many do, both local and cloud).

Then there's basically two paths in a rudimentary solution. You could do full-page OCR, store the coordinates of extracted n-grams, and then inspect the area where you think or know the text you want to extract will be. Or, you could crop the image to a rectangle where you think or know the text will be, OCR the rectangle, and see if the text is there. The latter is computationally cheaper but the rectangle isn't guaranteed to contain the text of interest-- assuming it's there at all-- and you won't easily be able to define things in relation to each other.

The top commercial products are a lot more complicated than this, though. A "template" for extracting Box 1 from a W-2 would do something like this:

- Crop the document to just the top third and OCR it

- Match the string "1 Wages, tips, or other compensation" which itself is something like "Try to find that exact string. No? Look for a 1 or something that resembles a 1. Is there an n-gram to the right of it beginning with the word Wages? No? OK how about wages in lower case. No? OK restrict the pattern match to just 1 line. Does that work? OK good. Does something like the word 'tips' appear in the n-gram? Ok that's probably it."

- Underneath that line, offset x to the right, look for a string of digits matching this regex.

The string matching is done probabilistically in branches based on the specified rules (like "there's this n-gram over here that seems like what you're looking for but you said it should be in this other location, so we'll mark that as a possibility but it's probably not what you're looking for") and with the assistance of a really powerful dictionary.

If you don't want to totally roll your own and you happen to write Java, check out OpenKM, which I believe has some of the necessary abstractions built in for zonal OCR.