| You're spot on. This model is based off a dictionary with scored words! For the curious you can see the dictionary here: https://github.com/sloria/TextBlob/blob/eb08c120d364e9086467... The package used is a pretty popular one called TextBlob. It is nifty for working with unlabeled data like we have with the HackerNews dataset. We really focused our definition of saltiness around being a combination of (subjective + negative) comments. We reduced the impact of (objective + negative) as we feel that criticism, while at times painful, if presented objectively isn't necessarily salty. We built this model fast (1 week) and have since iterated this week into developing a Fine Tuned BERT model that we are training over a much broader set of toxicity, demographic, and polarity features. The training set is much larger and higher quality so we are expecting a large jump in precision upon deployment. I hope the app gave you some good chuckles as you went around though. It's hard to explain how excited I felt when I saw pg_is_a_butt at the top of my pandas data frame the first time I processed the data. It's doing a little bit right. :) |