Hacker News new | ask | show | jobs
by waldrews 5788 days ago
I'd suggest, as a simple heuristic for ranking words for improbability/relevance, contribution to K-L divergence from the frequencies in the general-purpose word corpus:

Pln(P/Q)

where P is the frequency of the word in the narrow corpus (HN titles)

and Q is the frequency of the word in the general-purpose corpus

(formula doesn't work if Q is ever zero; this won't happen if the broader corpus includes the narrower one, as it should, but as a practicality, just make Q:=(1-a)Q+a*P for small positive a to simulate merging the smaller corpus into the larger)

http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...

Anybody with more time than I have at the moment want to code this up?