| Great point. 1. I wish everything was made from *fairy dust. How awesome would that be? :) 2. "Hate" is definitely hard to quantify. It's in fact quite difficult to map words to their intentions and get it right consistently (especially within a proper context). So difficult that people set up Kaggle competitions on exactly this. I actually got my "magical" training data from a competition that paid out $10k, which I explained in the article but here it is again: https://www.kaggle.com/c/detecting-insults-in-social-comment... They did a great job building a baseline training data set to evaluate several different models on. Which are all briefly explained or at least shown in code in the article. And what "hate" actually means here is the probability that a comment is considered insulting. The "hater score" is just an average of the most recent (or oldest, depending on your settings) comments' probabilities that they are insulting. 3. I read and looked at several different attempts to build something similar by various data scientists who were kind enough to share their findings, including a huge contributor to scikit-learn (https://github.com/amueller). 4. Taking out quoted text would be a great feature to add. I have about 5 or 6 new features I will probably add and see if the model works any better for it, thanks for the suggestion (another person was suggesting the same thing). :) 5. This was just to see how well "sprinkled algorithms" and magical coding works in the wild world of actual comments. I love learning and improving my knowledge base with actual experience so I figured why not build something and see what happens. :) |