|
|
|
|
|
by zorkian
2160 days ago
|
|
You're definitely right -- this is an issue. I could very well believe that we tripped some FB spam measures. We have a very manual anti-spam process right now that relies on humans to detect it and action it. We have a couple of very dedicated folks who end up looking every few hours, but it's not automated, and we don't have full timezone coverage. It's definitely something I'd like to see us improve, but we've been focused on other projects (like switching from mid-90s HTML to a responsive design, which is a slow rewrite of the entire site). That said, if you have any advice on reasonably scalable ways of doing this in-house that don't involve sending our user content to a third party, I'd love to take any recommendations! Feel free to email me, mark@dreamwidth.org, if you would rather do that. And if not, don't worry about it, I appreciate the comment anyway :) |
|
The nice thing about this is it's pretty computationally light and straightforward to implement for any language. I have no clue as to your stack, but if you have python for your backend then sklearn is a good library that has a naive bayes classifier (plus a lot of other better options). Any post with a high probability of being spam, I'd automatically flag and by default just remove with the option for a user to ask for manual review. Main thing you'd need for this or any fancier approach is some dataset of spam/non spam posts. If you have an easy way of retrieving past posts that were labelled spam that should allow you to make a fine dataset. If you don't want to train on your own user posts (although only information kept is word counts here), you can look online for spam datasets and use one of those to train your classifier.