| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by law 4810 days ago

You can estimate the likelihood that a particular sentence is spam by calculating the log sum of n-gram probabilities of sub-sequences in a sentence. These probabilities are obtained from a sufficiently general training set, such as Google's n-gram viewer[1]. You can estimate the probability of a particular sequence of words by summing the log probabilities of each n-gram within that sequence. Using a trigram language model (n = 3), you could estimate the likelihood as follows:

Sentence = "This sentence is semantically and syntactically valid."

P(Sentence) = log(p(START,START,This)) + log(p(START,This,sentence)) + log(p(This,sentence,is)) + log(p(sentence,is,semantically)) + log(p(is,semantically,and)) + log(p(semantically,and,syntactically)) + log(p(and,syntactically,valid)) + log(p(syntactically,valid,.)) + log(p(valid,.,STOP)) + log(p(.,STOP,STOP))

where START and STOP are special symbols that aid in determining the proximity of a word to the beginning and end of a sentence.

If your training set fails to sufficiently generalize, you could use Bayesian inference to estimate the likelihood that the sentence is spam. Under this framework, you'd be calculating the posterior probability of the sentence being spam given the observed sequence of n-grams, which combines (i) the inherent likelihood that any sequence of words is spam and (ii) the compatibility of an observed sequence with (i), which is proportional to the impact it has on (i).

[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....

2 comments

drakaal 4810 days ago

Your comment would be marked as spam using your logic. Was that intentional?

link

mattj 4810 days ago

Note this is exactly how a smart spammer would generate text (sampling from a language model, built on a public ally available data set like google ngrams or Wikipedia). If you wanted to catch someone doing this, you're much better off using your own corpus to generate a language model, as a spammer would have to scrape all your data to reconstruct the same thing.

Then, run the model over your data and start playing whack-a-mole (and refining the model).

link