@krapp there were some challenges building out a great model but you can download the whole repo and pull out just the machine learning part and see what I did, I have it commented out in an iPython Notebook. :)
It basically uses word tokenization using scikit-learn's count vectorizer and some extra features I added like "bad words", ratio of bad words to total words used, speaking in all CAPS, and a few other features. I then took the features and use logistic regression to predict the likely hood that a specific comment is insulting then average all a user's comments into one score.
I used training data from a kaggle competition and was able to score near the same level as the winners but it will definitely be improved as I keep working on it.
@krapp there were some challenges building out a great model but you can download the whole repo and pull out just the machine learning part and see what I did, I have it commented out in an iPython Notebook. :)
The Repo: https://github.com/kevinmcalear/hater_news
It basically uses word tokenization using scikit-learn's count vectorizer and some extra features I added like "bad words", ratio of bad words to total words used, speaking in all CAPS, and a few other features. I then took the features and use logistic regression to predict the likely hood that a specific comment is insulting then average all a user's comments into one score.
I used training data from a kaggle competition and was able to score near the same level as the winners but it will definitely be improved as I keep working on it.