| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ishi 3106 days ago

A few years ago I was involved with a startup that built a document management system for consumers, and we actually got pretty good results with OCR + automatic tagging based on a very simple database that maps keywords to tags.

Let's say you want to auto-tag bills and other documents from your ISP. So you add the ISP's name, phone number, website address etc. into the database - any uniquely-identifying keywords that typically appear on the documents that they send. Now any document that contains these keywords will get tagged as "ISP", making it very easy to find in the future.

Even if the OCR quality isn't perfect, at least one of these keywords will most likely get matched.

Another example - you could add the names of your family members as keywords, making it easy to find all documents related to Jenny or Susan.

You could argue that full-text search would achieve the same result, but uploading documents into the system and having them auto-tagged as "ISP", "car-payments", "Walmart", "Susan" and so on feels a little bit like magic, as if the system is actively helping you organize your papers.

The keyword approach is also very easy to understand and tweak, unlike more rigorous but opaque methods of document clustering (such as tf-idf).

2 comments

myaso 3105 days ago

Out of curiosity what is the state of the art today for extracting text or other data from scanned documents (forms, legal docs, receipts, etc) ?

link

matt_the_bass 3105 days ago

I don't have an exact answer but can tell you that Expensify still resorts to human parsing sometimes. How often "sometimes" is, I have no idea. I would guess a lot.

link

Spearchucker 3106 days ago

Everything you say is true, and the value, I think, is clear. The part I don't like is that I have to create a database manually. Granted, the results will save me time as I don't have to manually tag the routine.

Food for thought.

link