|
|
|
|
|
by a_p
4800 days ago
|
|
I'm surprised this post doesn't mention Markov chains. The author seems to think that finding and implementing a grammar quality checker will help stop spam. Aside from provides endless hours of entertainment viz. DissociatedPress, Markov chains are abused by spammers to generate grammatically correct nonsense. You can easily add meaning to the "nonsense" by adding formatting to certain words to add a secondary message. Does anyone know of a way to stop this? |
|
Sentence = "This sentence is semantically and syntactically valid."
P(Sentence) = log(p(START,START,This)) + log(p(START,This,sentence)) + log(p(This,sentence,is)) + log(p(sentence,is,semantically)) + log(p(is,semantically,and)) + log(p(semantically,and,syntactically)) + log(p(and,syntactically,valid)) + log(p(syntactically,valid,.)) + log(p(valid,.,STOP)) + log(p(.,STOP,STOP))
where START and STOP are special symbols that aid in determining the proximity of a word to the beginning and end of a sentence.
If your training set fails to sufficiently generalize, you could use Bayesian inference to estimate the likelihood that the sentence is spam. Under this framework, you'd be calculating the posterior probability of the sentence being spam given the observed sequence of n-grams, which combines (i) the inherent likelihood that any sequence of words is spam and (ii) the compatibility of an observed sequence with (i), which is proportional to the impact it has on (i).
[1] http://storage.googleapis.com/books/ngrams/books/datasetsv2....