|
|
|
|
|
by mystique
4132 days ago
|
|
We had to do something similar in our real world example where we had the labels but were unsure if the labels were truly accurate or not. We used a technique similar to LSA. Our first step was to build a bag of words and construct a scaled TF off of that. Then we verified the label for about 10% of the data and used that as our training set. Using cosine similarity (which we calculated using matrix multiplication of tfs) we found top n labeled documents that were similar to document in question to decide the label of remaining 90% documents. Once we had this dataset we ran it against logistic regression as well by training on same 10% and use remaining to find the label. Interestingly document similarity was only slightly better than logistic regression. Logistic was 10 times faster. I think this approach worked for us because we had somewhat of a mutually exclusive set of words for one or other label. This may not work in sentiment analysis where same word can have different meanings depending on surrounding words. N-Grams and then TF on it might help in that case. |
|