| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by wodenokoto 3825 days ago

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

1 comments

Relevant line in code:

   # remove sparse terms
   all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215

I believe it corresponds to the tfidf factor.