| I see you have mentioned TF-IDF as something which you are planning to try. That should be interesting. The way I see it, (and i may very well be slightly off point)
you have a corpus of 2000 docs
2 lists -> [Wpos] & [Wneg] with count[Wneg] a factor more than count[Wpos] if you compute a [0-1] normalized tf-idf score for each term in the set [Wpos] & [Wneg] and sum them up for all words in each of those two sets, you get a score proportional to the count of positive words & negative words. Normalized here would mean using relative frequencies, rather than absolute freq. [I prefer calling the latter term counts] This puts document_word_count based normalization out of picture and makes it implicit in the tf-idf step. Now you have Two numbers, Sum(Positive normalized TF-IDFs) and Sum(Negative Normalized TF-IDFs) which you can individually normalize for your list sizes, and then use the two scores for sentiment classification.
A dirty hack, and somewhat inefficient if you don't maintain a reverse index. Second approach could be this.
Use your Word List, both positive and negative, to do a Okapi BM25 scoring against your docs using the list as the query set.
So you would get a BM25score for your docs. and you can use that to define sentiments. Corpus - D
Di = Document in the corpus you want to classify
Query1 = {set of positive words}
Query2 = {set of negative words} PositiveScore = BM25(Query1, Di )
NegativeScore = BM25(Query2, Di ) Some Combination to do classification.
if Positive > Negative Score :
call it positive! Just a thought.
BM25 has some flexibility in tuning it for length normalizations. Check footnote. PS: There is the British National Corpus too for word frequencies :) [1]BM25 and normalizations.
http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-... |