|
|
|
|
|
by pierrefar
6635 days ago
|
|
I'm playing with some code for this just for fun. There are two ways to classify a document: 1. You already have a set of keywords (categories) that the document can belong to. The objective is to match a document to its category. This is true classification. 2. You want to extract relevant keywords from a document. This is not classification in the true sense but keyword extraction. Each one has different approaches to achieve it, but they are similar problems. (As an example of similarity: you have a set of categories each defined by a tag cloud. You extract keywords from a document and see which tag cloud it matches best.) So how do you do each one? Classification: I'm not well versed in this area and I'm interested in learning - it's next on my to-do list. Keyword Extraction: Yahoo! has an API to do that, but honestly, it's rubbish. I don't know how it "works" but it doesn't really. Open Calais is really good but has a noticeable error rate (I didn't quantify it but after trying many documents with it, I regularly noticed minor mistakes). Hope this helps. |
|