Hacker News new | ask | show | jobs
by jbondeson 3605 days ago
Worked on this problem exactly 2-3 years ago (developed automated document processing in the accounts receivable and accounts payable sector for a decade plus). It's a fun iceberg problem that looks simple on the surface but tends to have some real thorns the deeper down you go.

Document identification like this is unfortunately the "easy" (and it's not particularly easy to do real time) part. The next two steps involve 3D de-deformation since unlike a flatbed scanner you cannot assume the paper is actually completely flat -- imagine a previously folded page, etc.

I love this stuff as it is at a crossroads of a half dozen different disciplines. Lots of money to be had if this can be done is a really robust manner.

Edit:

A couple examples of why this gets really hairy really fast:

* You'll notice that all the documents are shown on a high contrast background (dark wood grain) without a lot of stark lighting. One of your first steps in edge detection and line identification is image segmentation to remove background from foreground and then start removing noise. If you have a white piece of paper on a white table, or a large lighting contrast (say from an open window casting daylight on half the page) it really wreaks havoc with the algorithms.

* Imagine you're trying to recognize a page from a text book in the middle of the book. The way the page lies you end up with non-rectangular pages (they curve due to the spine) which kills the hough line transformation (there are also hough circle algorithms, but you get the point) and the rectangle selection.

2 comments

I remember this SO question from the high-contract background point you brought up -- http://stackoverflow.com/questions/36982736/how-to-crop-bigg...
Thanks for sharing this, really helpful!
Since I am working on a similar problem at the moment myself, It'd be great if you could share some insights on fixing the 3D deformation -- I imagine fitting a polygon followed with a warp transformation could be an "ideal" process?

In the contrast problem you mention there, I found (in a few samples that I tested with) that adaptive thresholding seem to be sufficiently good [0].

[0] I am using ``skimage.filters.threshold_adaptive`` for this.

On the contrast topic: adaptive thresholding can be very helpful (I believe Bradley Local Thresholding was one I had particular success with) however most of these algorithms work in a grayscale domain which means they are dependent upon which color->grayscale transformation is used[1]. I spent a long time researching full color algorithms but never got to a truly successful end result with them. And even if you get a good image with huge contrast you still will end up with the actual light/dark transition looking like an edge.

On 3D deformation, you're officially in academic research land. Nearly all algorithms require you to have a solid guess as to what the aspect ratio of the target object is. Other algorithms use heuristics based upon what you expect to find on a page. One particularly fun algorithm used the baseline of text (I believe for that paper it was Arabic) and fit a high-order curve to it which was then reversed. Unfortunately I haven't seen a truly generic approach that doesn't require a implementation-specific input.

[1] Frankly my feeling is that RGB to grayscale is a mistake and holding back many of these algorithms

Yep, we turned RGB into LUV space before extracting edges, which helps a lot on contrast and keeps essential edge information that could've been lost if converted to grayscale.

Agree with that 3D deformation is a difficult open problem, and we haven't gotten into that yet. Currently we assumed the document is a flat rectangle, which maps to a quadrilateral in image space. A homography is then applied to rectify it, and it seems to work quite well if the paper is slightly curved or folded.

Excellent. It's a little funny how when you start problems like these you start becoming an expert in fields you never thought you'd have to play in like color spaces, color perception theory, etc.

Great work, and I look forward to seeing future posts on the solutions you've been able to come up with!

Yeah, I got a serious education doing this for mail items. And I had it easier as I was able to control the background and lighting and camera and everything.

Well, I couldn't control the autofocus very well, going from a $500 DSLR to a $1200 DSLR made HUGE gains since it'd have far, far more autofocus points.

I was really interested in the text output of the OCR that I later did (which was a treat in itself since mail has so many different fonts, even on the same item!). I learned a lot about a lot of things too.

I have found colorspace transformation to be an important factor as well. My current problem would not require fixing 3D deformation, but I am finding it really interesting thing that I'd like to be working on in future.

Thanks for this additional information, much appreciated!

For the 3D deformation take a look at this part of OpenCV:

http://docs.opencv.org/2.4/doc/tutorials/features2d/feature_...