|
|
|
|
|
by kaddar
5262 days ago
|
|
To stick with good first step approaches, look at ngrams and nwords. Basically, you need a reasonable feature to match similarity on. N-words are pretty easy to construct, a 2-gram would be every pair of words used in a document. Tf-idf is a good metric with that kind of feature, because it handles well the bias of frequent words like "the" |
|