|
|
|
|
|
by markovbling
4132 days ago
|
|
I have a list of positive and negative words and a set of documents which I want to score so not sure if I have a 'training set'. I think you mean to upweight my positive list by 6 (since it is 1/6 of the size of the negative list) but the problem with this is the same as my reply to the other comment where you just shift the bias: Consider the sentence: 'there are strong and weak divisions in company X's Europe operations' The only word matches in your word lists are 'strong' on your positive list and 'weak' on your negative list. If you weight these counts as you describe, your sentiment for this sentence will be -0.44 + 1 = 0.66 even though the sentence is clearly 'neutral' and should have a score of 0. |
|
I did misunderstand that you don't have a training set, just a list of positive and negative words. You could still apply a similar idea.
You could test your hypothesis that the score is biased by looking at the average number of positive and negative words per document, and slightly modify your factors. For example if you found that the average document had 6 negative words and 4 positive words, but you think that the average sentiment is neutral across your documents, you could multiply the positive word count by 1.5. It's a less brutal way to accomplish a similar outcome without increasing the complexity of your algorithm.
Otherwise, you will need to use an algorithm that has more discrimination power, and this will likely mean you need a training set. You can go very deep down that rabbit hole, but I would consider starting with Naive Bayes which is essentially learning a weight per positive and negative word and combines them in a similar manner to how you're doing so now. It has the advantage of being a simple algorithm.