Hacker News new | ask | show | jobs
by barneso 4132 days ago
You are right; this does just shift the bias, which is sometimes all you need (you have a simple algorithm, presumably for a reason).

I did misunderstand that you don't have a training set, just a list of positive and negative words. You could still apply a similar idea.

You could test your hypothesis that the score is biased by looking at the average number of positive and negative words per document, and slightly modify your factors. For example if you found that the average document had 6 negative words and 4 positive words, but you think that the average sentiment is neutral across your documents, you could multiply the positive word count by 1.5. It's a less brutal way to accomplish a similar outcome without increasing the complexity of your algorithm.

Otherwise, you will need to use an algorithm that has more discrimination power, and this will likely mean you need a training set. You can go very deep down that rabbit hole, but I would consider starting with Naive Bayes which is essentially learning a weight per positive and negative word and combines them in a similar manner to how you're doing so now. It has the advantage of being a simple algorithm.

2 comments

Reweighting sentiment by looking at the number of occurrences of positive and negative words in my assumed neutral corpus is a great idea :)

Will implement and report back

I've looked into using Naive Bayes but my understanding is you need labeled training documents and then I face the problem of scoring documents which introduces subjectivity compared to just counting the 'sentiment words'.

I understand complexity is needed to deal with negation ('not bad' != 'bad') but I'd imagine that the sentiment scoring process would be the same regardless of algorithm which brings us back to the problem of how to correct bias in 'word list' asymmetries

I like your "2." suggestion more, because the initial sentiment score distribution can be not normal.

So there is an option to try making it normal by taking logarithm for example and calculating mean, etc. after that.

I would still expect it to tend towards the normal distribution across a large set of documents. If you model positive and negative word counts as a binomial distribution, you have the the difference of two samples from different binomial distributions which would still tend towards normal (I think, though I'm not 100% sure, certainly it's true within my experience). A logarithm would skew away from positive to negative sentiment and is undefined for negative values.
It only tends to a Normal distribution if you estimate P(negative|matches in -ve list) & P(positive|matches in +ve list) with an unbiased, consistent estimator.

A simple 1-gram model like in the question does not model many complexities of natural language e.g. negation ("not bad" != "bad") so you would expect your estimator to over-represent the dictionary with more words that are equal to their adverb-adjusted equivalent. e.g. "not bad" can be described as 'terrible' more readily than 'very good' can be described as excellent since people assign a hyperbolic weighting to their own happiness (utility theory 101)

The sentiment would only tend to a normal distribution if we had perfect estimators for document sentiment which requires advanced POS tagging and models more complex than a 1-gram bag of words aggregation :)