Yep, although in some cases understandably. One tweet listed as ‘Not good’ had the text ‘Killed it! 🔫’ and a location given as a comedy club. I’m assuming someone had had a good gig, but I’m not surprised a classifier algorithm got that wrong.
Some cases are just difficult, but the overall accuracy could probably be improved considerably if the sentiment analyzer were calibrated for the domain. AFINN (linked above) is calibrated from newspaper articles, which almost certainly have a different distribution of word/sentiment correlations than Tweets do. It's not hard for me to imagine that "killed" is a better predictor of negative sentiment when classifying news articles.
Sentiment.js (which is what we used in devwax and I am guessing the same here) is just AFINN based Sentiment.https://www.npmjs.org/package/sentiment which you can customize. So in our case we added things like {"barreling": 2} etc. What would be better is a bigram/trigram based approach do you could score "Going Off" and "Killed It" etc, but I am not sure if there is a js library that does that?