|
|
|
|
|
by toxik
2611 days ago
|
|
Modern-day spam is typically generative, and modelling the distribution of "natural e-mail messages" is sadly too naive today. Human beings also understand text through vision, not through bits -- so 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words. That also gave rise to what is probably a more common spam variant today: the text-embedded-as-an-image type. I've long been of the impression that the only proper way to do text analysis is by vision, a more end-to-end solution as it were. |
|
No it doesn’t, and pg explains why in his essay. (Don’t know if the article states this too as since I’ve already read the essay before I didn’t bother to read a summarizing article about it. The essay is really excellent though.)
Quote from the essay:
> I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.
> [...]
> To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character. And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body. I don't know enough about the infrastructure that spammers use to know how hard it would be to make the headers look innocent, but my guess is that it would be even harder than making the message look innocent.
http://www.paulgraham.com/spam.html
And as for your point about text in image I don’t know of any email client today that defaults to showing images from unkown senders.
I receive a lot of spam and it is all very distinct in nature and the Bayesian approach is still the way to go for fighting it I think.