Hacker News new | ask | show | jobs
by barneso 4132 days ago
Two simple things you could do:

1. Insert each negative example six times into your training set (or weight negative examples accordingly, ie use #positive matches - 6 * #negative matches / (2 * positive word count) as your score

2. Take your distribution of sentiment scores as calculated over held out data (or the training set itself, but be warned that this will skew your results), and calculate the mean and standard deviation. Normalize your results by subtracting the mean and dividing by the standard deviation. You can then say that positive sentiment is > 0 and negative sentiment < 0, with the absolute value being the strength of the classification.

1 comments

I have a list of positive and negative words and a set of documents which I want to score so not sure if I have a 'training set'.

I think you mean to upweight my positive list by 6 (since it is 1/6 of the size of the negative list) but the problem with this is the same as my reply to the other comment where you just shift the bias:

Consider the sentence: 'there are strong and weak divisions in company X's Europe operations'

The only word matches in your word lists are 'strong' on your positive list and 'weak' on your negative list.

If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0.

You are right; this does just shift the bias, which is sometimes all you need (you have a simple algorithm, presumably for a reason).

I did misunderstand that you don't have a training set, just a list of positive and negative words. You could still apply a similar idea.

You could test your hypothesis that the score is biased by looking at the average number of positive and negative words per document, and slightly modify your factors. For example if you found that the average document had 6 negative words and 4 positive words, but you think that the average sentiment is neutral across your documents, you could multiply the positive word count by 1.5. It's a less brutal way to accomplish a similar outcome without increasing the complexity of your algorithm.

Otherwise, you will need to use an algorithm that has more discrimination power, and this will likely mean you need a training set. You can go very deep down that rabbit hole, but I would consider starting with Naive Bayes which is essentially learning a weight per positive and negative word and combines them in a similar manner to how you're doing so now. It has the advantage of being a simple algorithm.

Reweighting sentiment by looking at the number of occurrences of positive and negative words in my assumed neutral corpus is a great idea :)

Will implement and report back

I've looked into using Naive Bayes but my understanding is you need labeled training documents and then I face the problem of scoring documents which introduces subjectivity compared to just counting the 'sentiment words'.

I understand complexity is needed to deal with negation ('not bad' != 'bad') but I'd imagine that the sentiment scoring process would be the same regardless of algorithm which brings us back to the problem of how to correct bias in 'word list' asymmetries

I like your "2." suggestion more, because the initial sentiment score distribution can be not normal.

So there is an option to try making it normal by taking logarithm for example and calculating mean, etc. after that.

I would still expect it to tend towards the normal distribution across a large set of documents. If you model positive and negative word counts as a binomial distribution, you have the the difference of two samples from different binomial distributions which would still tend towards normal (I think, though I'm not 100% sure, certainly it's true within my experience). A logarithm would skew away from positive to negative sentiment and is undefined for negative values.
It only tends to a Normal distribution if you estimate P(negative|matches in -ve list) & P(positive|matches in +ve list) with an unbiased, consistent estimator.

A simple 1-gram model like in the question does not model many complexities of natural language e.g. negation ("not bad" != "bad") so you would expect your estimator to over-represent the dictionary with more words that are equal to their adverb-adjusted equivalent. e.g. "not bad" can be described as 'terrible' more readily than 'very good' can be described as excellent since people assign a hyperbolic weighting to their own happiness (utility theory 101)

The sentiment would only tend to a normal distribution if we had perfect estimators for document sentiment which requires advanced POS tagging and models more complex than a 1-gram bag of words aggregation :)

I meant -0.167 + 1 = 0.83 > 0 therefore positive sentiment :)