|
|
|
|
|
by minimaxir
3353 days ago
|
|
A quick note about using natural language/sentiment APIs: trained machine learning models must be used apples-to-apples on similar datasets; for example, you can’t accurately perform Twitter sentiment analysis on a dataset using a model trained on professional movie reviews since Tweets do not follow AP Style guidelines. (e.g. for some reason, training Python's NLTK on the IMDb movie review dataset to predict the sentiment of Donald Trump's tweets is a oddly common Hello World, even though the results are misleading and may cause confirmation bias) Reddit comments are very idiosyncratic, and in this particular case, even moreso than usual. As a result, I am skeptical of trusting the output of such APIs as gospel, even one trained on massive datasets. (however, training a model on a Reddit-only dataset might be interesting, and is an idea I have in the pipeline.) Last year, spaCy trained a model, sense2vec, on the Reddit dataset and got interesting results: https://explosion.ai/blog/sense2vec-with-spacy |
|
(I don't disagree with anything you wrote, just expanding.)