Hacker News new | ask | show | jobs
by minimaxir 3353 days ago
A quick note about using natural language/sentiment APIs: trained machine learning models must be used apples-to-apples on similar datasets; for example, you can’t accurately perform Twitter sentiment analysis on a dataset using a model trained on professional movie reviews since Tweets do not follow AP Style guidelines. (e.g. for some reason, training Python's NLTK on the IMDb movie review dataset to predict the sentiment of Donald Trump's tweets is a oddly common Hello World, even though the results are misleading and may cause confirmation bias)

Reddit comments are very idiosyncratic, and in this particular case, even moreso than usual. As a result, I am skeptical of trusting the output of such APIs as gospel, even one trained on massive datasets. (however, training a model on a Reddit-only dataset might be interesting, and is an idea I have in the pipeline.)

Last year, spaCy trained a model, sense2vec, on the Reddit dataset and got interesting results: https://explosion.ai/blog/sense2vec-with-spacy

2 comments

This is a complicated problem and is I think best thought of as type of overfitting rather than a complete mistake. The independent or output variable, sentiment, does have an obvious generalisation from movies to politicians, unlike, for example, cinematography quality or trustworthiness. You are also overtraining when you test movie sentiment in the 2010s with reviews trained in the 90s as the concept of sentiment might have shifted if you look at it in that much detail.

(I don't disagree with anything you wrote, just expanding.)

the good news is as long as the training data is known accurate (basically human-prepared), you can use a relatively tiny amount of it for very good results on huge datasets.