|
|
|
|
|
by waldrews
5788 days ago
|
|
I'd suggest, as a simple heuristic for ranking words for improbability/relevance, contribution to K-L divergence from the frequencies in the general-purpose word corpus: Pln(P/Q) where P is the frequency of the word in the narrow corpus (HN titles) and Q is the frequency of the word in the general-purpose corpus (formula doesn't work if Q is ever zero; this won't happen if the broader corpus includes the narrower one, as it should, but as a practicality, just make Q:=(1-a)Q+a*P for small positive a to simulate merging the smaller corpus into the larger) http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg... Anybody with more time than I have at the moment want to code this up? |
|