|
You can estimate the likelihood that a particular sentence is spam by calculating the log sum of n-gram probabilities of sub-sequences in a sentence. These probabilities are obtained from a sufficiently general training set, such as Google's n-gram viewer[1]. You can estimate the probability of a particular sequence of words by summing the log probabilities of each n-gram within that sequence. Using a trigram language model (n = 3), you could estimate the likelihood as follows: Sentence = "This sentence is semantically and syntactically valid." P(Sentence) = log(p(START,START,This)) + log(p(START,This,sentence)) + log(p(This,sentence,is)) + log(p(sentence,is,semantically)) + log(p(is,semantically,and)) + log(p(semantically,and,syntactically)) + log(p(and,syntactically,valid)) + log(p(syntactically,valid,.)) + log(p(valid,.,STOP)) + log(p(.,STOP,STOP)) where START and STOP are special symbols that aid in determining the proximity of a word to the beginning and end of a sentence. If your training set fails to sufficiently generalize, you could use Bayesian inference to estimate the likelihood that the sentence is spam. Under this framework, you'd be calculating the posterior probability of the sentence being spam given the observed sequence of n-grams, which combines (i) the inherent likelihood that any sequence of words is spam and (ii) the compatibility of an observed sequence with (i), which is proportional to the impact it has on (i). [1] http://storage.googleapis.com/books/ngrams/books/datasetsv2.... |