Hacker News new | ask | show | jobs
by barneso 4132 days ago
I would still expect it to tend towards the normal distribution across a large set of documents. If you model positive and negative word counts as a binomial distribution, you have the the difference of two samples from different binomial distributions which would still tend towards normal (I think, though I'm not 100% sure, certainly it's true within my experience). A logarithm would skew away from positive to negative sentiment and is undefined for negative values.
1 comments

It only tends to a Normal distribution if you estimate P(negative|matches in -ve list) & P(positive|matches in +ve list) with an unbiased, consistent estimator.

A simple 1-gram model like in the question does not model many complexities of natural language e.g. negation ("not bad" != "bad") so you would expect your estimator to over-represent the dictionary with more words that are equal to their adverb-adjusted equivalent. e.g. "not bad" can be described as 'terrible' more readily than 'very good' can be described as excellent since people assign a hyperbolic weighting to their own happiness (utility theory 101)

The sentiment would only tend to a normal distribution if we had perfect estimators for document sentiment which requires advanced POS tagging and models more complex than a 1-gram bag of words aggregation :)