| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zorkian 2160 days ago

You're definitely right -- this is an issue. I could very well believe that we tripped some FB spam measures.

We have a very manual anti-spam process right now that relies on humans to detect it and action it. We have a couple of very dedicated folks who end up looking every few hours, but it's not automated, and we don't have full timezone coverage.

It's definitely something I'd like to see us improve, but we've been focused on other projects (like switching from mid-90s HTML to a responsive design, which is a slow rewrite of the entire site). That said, if you have any advice on reasonably scalable ways of doing this in-house that don't involve sending our user content to a third party, I'd love to take any recommendations!

Feel free to email me, mark@dreamwidth.org, if you would rather do that. And if not, don't worry about it, I appreciate the comment anyway :)

1 comments

Mehdi2277 2160 days ago

The simplest spam filtering algorithm would be a naive bayes filter. It's essentially keep a count of words that appear in all posts, words that appear in spam posts, and words in non spam posts. Those counts + bayes rule will let you figure out the probability of spam given a word. It's called naive bayes because you assume each word in your post is independent of the others so probability the whole post is spam is just product of the probabilities.

The nice thing about this is it's pretty computationally light and straightforward to implement for any language. I have no clue as to your stack, but if you have python for your backend then sklearn is a good library that has a naive bayes classifier (plus a lot of other better options). Any post with a high probability of being spam, I'd automatically flag and by default just remove with the option for a user to ask for manual review. Main thing you'd need for this or any fancier approach is some dataset of spam/non spam posts. If you have an easy way of retrieving past posts that were labelled spam that should allow you to make a fine dataset. If you don't want to train on your own user posts (although only information kept is word counts here), you can look online for spam datasets and use one of those to train your classifier.

link

gus_massa 2159 days ago

I used SpamBayes a few years ago http://www.spambayes.org/ (Is the project dead now?) (It has a PSF licence https://en.wikipedia.org/wiki/Python_Software_Foundation_Lic... https://en.wikipedia.org/wiki/Comparison_of_free_and_open-so...)

The nice part is that SpamBayes gives you two numbers, the spam "probability" and the ham "probability". When one of them is very close to 1 (like > .99) and the other is very close to 0 (like <.01), there is a good chance that the message is really spam or ham. And this classify almost all the messages. But from time to time you get a message where the numbers are not so clear, or both are big or both are small, and this means the classifier is confused and you really must take a look at the message.

link

audiometry 2158 days ago

Wow when this came out (I think this was the ‘original’) it felt quite ground breaking. Perhaps early 2000s it was?

Then google started doing that or something similar at scale and effectively eliminated spam in my mailbox ever since. (With the curious recent exception of some highly similar bitcoins spams)

link