Hacker News new | ask | show | jobs
by wiresurfer 4143 days ago
I see you have mentioned TF-IDF as something which you are planning to try. That should be interesting.

The way I see it, (and i may very well be slightly off point) you have a corpus of 2000 docs 2 lists -> [Wpos] & [Wneg] with count[Wneg] a factor more than count[Wpos]

if you compute a [0-1] normalized tf-idf score for each term in the set [Wpos] & [Wneg] and sum them up for all words in each of those two sets, you get a score proportional to the count of positive words & negative words. Normalized here would mean using relative frequencies, rather than absolute freq. [I prefer calling the latter term counts]

This puts document_word_count based normalization out of picture and makes it implicit in the tf-idf step.

Now you have Two numbers, Sum(Positive normalized TF-IDFs) and Sum(Negative Normalized TF-IDFs) which you can individually normalize for your list sizes, and then use the two scores for sentiment classification. A dirty hack, and somewhat inefficient if you don't maintain a reverse index.

Second approach could be this. Use your Word List, both positive and negative, to do a Okapi BM25 scoring against your docs using the list as the query set. So you would get a BM25score for your docs. and you can use that to define sentiments.

Corpus - D Di = Document in the corpus you want to classify Query1 = {set of positive words} Query2 = {set of negative words}

PositiveScore = BM25(Query1, Di ) NegativeScore = BM25(Query2, Di )

Some Combination to do classification. if Positive > Negative Score : call it positive!

Just a thought. BM25 has some flexibility in tuning it for length normalizations. Check footnote.

PS: There is the British National Corpus too for word frequencies :)

[1]BM25 and normalizations. http://nlp.stanford.edu/IR-book/html/htmledition/okapi-bm25-...

1 comments

Wow, thank you so much for pointing out BM25 - hadn't heard of it but looks very cool. Implementing it ASAP.