Hacker News new | ask | show | jobs
by myffical 5787 days ago
You need to massage your data to get more meaningful results.

It might be interesting to compare your word counts with the word counts from a general-purpose word corpus, then pick out words that appear more frequently by a statistically-significant amount. Something like Amazon's statistically improbable phrases algorithm.

1 comments

I'd suggest, as a simple heuristic for ranking words for improbability/relevance, contribution to K-L divergence from the frequencies in the general-purpose word corpus:

Pln(P/Q)

where P is the frequency of the word in the narrow corpus (HN titles)

and Q is the frequency of the word in the general-purpose corpus

(formula doesn't work if Q is ever zero; this won't happen if the broader corpus includes the narrower one, as it should, but as a practicality, just make Q:=(1-a)Q+a*P for small positive a to simulate merging the smaller corpus into the larger)

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...

Anybody with more time than I have at the moment want to code this up?