Hacker News new | ask | show | jobs
by mystique 4132 days ago
We had to do something similar in our real world example where we had the labels but were unsure if the labels were truly accurate or not.

We used a technique similar to LSA.

Our first step was to build a bag of words and construct a scaled TF off of that. Then we verified the label for about 10% of the data and used that as our training set. Using cosine similarity (which we calculated using matrix multiplication of tfs) we found top n labeled documents that were similar to document in question to decide the label of remaining 90% documents.

Once we had this dataset we ran it against logistic regression as well by training on same 10% and use remaining to find the label. Interestingly document similarity was only slightly better than logistic regression. Logistic was 10 times faster.

I think this approach worked for us because we had somewhat of a mutually exclusive set of words for one or other label. This may not work in sentiment analysis where same word can have different meanings depending on surrounding words. N-Grams and then TF on it might help in that case.