Hacker News new | ask | show | jobs
by a_p 4800 days ago
I'm surprised this post doesn't mention Markov chains. The author seems to think that finding and implementing a grammar quality checker will help stop spam. Aside from provides endless hours of entertainment viz. DissociatedPress, Markov chains are abused by spammers to generate grammatically correct nonsense. You can easily add meaning to the "nonsense" by adding formatting to certain words to add a secondary message. Does anyone know of a way to stop this?
3 comments

You can estimate the likelihood that a particular sentence is spam by calculating the log sum of n-gram probabilities of sub-sequences in a sentence. These probabilities are obtained from a sufficiently general training set, such as Google's n-gram viewer[1]. You can estimate the probability of a particular sequence of words by summing the log probabilities of each n-gram within that sequence. Using a trigram language model (n = 3), you could estimate the likelihood as follows:

Sentence = "This sentence is semantically and syntactically valid."

P(Sentence) = log(p(START,START,This)) + log(p(START,This,sentence)) + log(p(This,sentence,is)) + log(p(sentence,is,semantically)) + log(p(is,semantically,and)) + log(p(semantically,and,syntactically)) + log(p(and,syntactically,valid)) + log(p(syntactically,valid,.)) + log(p(valid,.,STOP)) + log(p(.,STOP,STOP))

where START and STOP are special symbols that aid in determining the proximity of a word to the beginning and end of a sentence.

If your training set fails to sufficiently generalize, you could use Bayesian inference to estimate the likelihood that the sentence is spam. Under this framework, you'd be calculating the posterior probability of the sentence being spam given the observed sequence of n-grams, which combines (i) the inherent likelihood that any sequence of words is spam and (ii) the compatibility of an observed sequence with (i), which is proportional to the impact it has on (i).

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

Your comment would be marked as spam using your logic. Was that intentional?
Note this is exactly how a smart spammer would generate text (sampling from a language model, built on a public ally available data set like google ngrams or Wikipedia). If you wanted to catch someone doing this, you're much better off using your own corpus to generate a language model, as a spammer would have to scrape all your data to reconstruct the same thing.

Then, run the model over your data and start playing whack-a-mole (and refining the model).

Having done some link spam (in decent volume) in the past, your best bet in stopping it would be using more sophisticated captchas (such as the ones that have you identify a cartoon in an image or something non textual) and identifying the system posting the comment/spam link and seeing if they're a real user with a real browser or just an automated spambot like xrumer or scrapebox.

You can also turn off links in comment bodies and the URL field of the comment form to try and prevent scrapers from even finding you a worthy target. Won't help identify spam though.

Finally, centralized spam identification systems like Akismet work really damn well because they are watching the whole site network at once and can use those heuristics to identify spammers rather than the actual spam content itself.

No need to go that far. Just download OpenNLP and it will do the labeling quite well with almost no effort.