| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kaddar 5308 days ago

To stick with good first step approaches, look at ngrams and nwords.

Basically, you need a reasonable feature to match similarity on. N-words are pretty easy to construct, a 2-gram would be every pair of words used in a document.

Tf-idf is a good metric with that kind of feature, because it handles well the bias of frequent words like "the"