| A few years ago I was involved with a startup that built a document management system for consumers, and we actually got pretty good results with OCR + automatic tagging based on a very simple database that maps keywords to tags. Let's say you want to auto-tag bills and other documents from your ISP. So you add the ISP's name, phone number, website address etc. into the database - any uniquely-identifying keywords that typically appear on the documents that they send. Now any document that contains these keywords will get tagged as "ISP", making it very easy to find in the future. Even if the OCR quality isn't perfect, at least one of these keywords will most likely get matched. Another example - you could add the names of your family members as keywords, making it easy to find all documents related to Jenny or Susan. You could argue that full-text search would achieve the same result, but uploading documents into the system and having them auto-tagged as "ISP", "car-payments", "Walmart", "Susan" and so on feels a little bit like magic, as if the system is actively helping you organize your papers. The keyword approach is also very easy to understand and tweak, unlike more rigorous but opaque methods of document clustering (such as tf-idf). |