Hacker News new | ask | show | jobs
by kaddar 5262 days ago
To stick with good first step approaches, look at ngrams and nwords.

Basically, you need a reasonable feature to match similarity on. N-words are pretty easy to construct, a 2-gram would be every pair of words used in a document.

Tf-idf is a good metric with that kind of feature, because it handles well the bias of frequent words like "the"