|
|
|
|
|
by barneso
4132 days ago
|
|
You are right; this does just shift the bias, which is sometimes all you need (you have a simple algorithm, presumably for a reason). I did misunderstand that you don't have a training set, just a list of positive and negative words. You could still apply a similar idea. You could test your hypothesis that the score is biased by looking at the average number of positive and negative words per document, and slightly modify your factors. For example if you found that the average document had 6 negative words and 4 positive words, but you think that the average sentiment is neutral across your documents, you could multiply the positive word count by 1.5. It's a less brutal way to accomplish a similar outcome without increasing the complexity of your algorithm. Otherwise, you will need to use an algorithm that has more discrimination power, and this will likely mean you need a training set. You can go very deep down that rabbit hole, but I would consider starting with Naive Bayes which is essentially learning a weight per positive and negative word and combines them in a similar manner to how you're doing so now. It has the advantage of being a simple algorithm. |
|
Will implement and report back
I've looked into using Naive Bayes but my understanding is you need labeled training documents and then I face the problem of scoring documents which introduces subjectivity compared to just counting the 'sentiment words'.
I understand complexity is needed to deal with negation ('not bad' != 'bad') but I'd imagine that the sentiment scoring process would be the same regardless of algorithm which brings us back to the problem of how to correct bias in 'word list' asymmetries