Hacker News new | ask | show | jobs
by SergeyHack 4132 days ago
I like your "2." suggestion more, because the initial sentiment score distribution can be not normal.

So there is an option to try making it normal by taking logarithm for example and calculating mean, etc. after that.

1 comments

I would still expect it to tend towards the normal distribution across a large set of documents. If you model positive and negative word counts as a binomial distribution, you have the the difference of two samples from different binomial distributions which would still tend towards normal (I think, though I'm not 100% sure, certainly it's true within my experience). A logarithm would skew away from positive to negative sentiment and is undefined for negative values.
It only tends to a Normal distribution if you estimate P(negative|matches in -ve list) & P(positive|matches in +ve list) with an unbiased, consistent estimator.

A simple 1-gram model like in the question does not model many complexities of natural language e.g. negation ("not bad" != "bad") so you would expect your estimator to over-represent the dictionary with more words that are equal to their adverb-adjusted equivalent. e.g. "not bad" can be described as 'terrible' more readily than 'very good' can be described as excellent since people assign a hyperbolic weighting to their own happiness (utility theory 101)

The sentiment would only tend to a normal distribution if we had perfect estimators for document sentiment which requires advanced POS tagging and models more complex than a 1-gram bag of words aggregation :)