Hacker News new | ask | show | jobs
by pierrefar 6635 days ago
I'm playing with some code for this just for fun. There are two ways to classify a document:

1. You already have a set of keywords (categories) that the document can belong to. The objective is to match a document to its category. This is true classification.

2. You want to extract relevant keywords from a document. This is not classification in the true sense but keyword extraction.

Each one has different approaches to achieve it, but they are similar problems. (As an example of similarity: you have a set of categories each defined by a tag cloud. You extract keywords from a document and see which tag cloud it matches best.)

So how do you do each one?

Classification: I'm not well versed in this area and I'm interested in learning - it's next on my to-do list.

Keyword Extraction: Yahoo! has an API to do that, but honestly, it's rubbish. I don't know how it "works" but it doesn't really. Open Calais is really good but has a noticeable error rate (I didn't quantify it but after trying many documents with it, I regularly noticed minor mistakes).

Hope this helps.