Hacker News new | ask | show | jobs
by ai_maker 4135 days ago
Do you have the gold standard labels of your dataset? Can you ensure that the amount of pos/neg labels is symmetrical?

You can heuristically tune the weights of your lexicon to fit your intuition, but evidence is necessary to progress adequately.

In case you find unbalanced amount of examples, apply an unbalanced effectiveness score like the F-measure to obtain a fair performance of your system.

1 comments

This is part of my problem - I don't have a labeled dataset outside of my 'positive words' / 'negative words' lists.

I don't think asymmetrical test-sets would be a problem if I had training data for documents since you can reweight to compensate - it would seem my problem is that over-representing the universe of matches for negative points due to a bigger 'negative word list' is introcucing bias and I'm not sure how to solve that.

Please see my reply on reweighting in this thread (if you reweight positive words to normalize the over-represented negative word count then a neutral sentence will have a positive sentiment score)