Hacker News new | ask | show | jobs
by rouvax 2610 days ago
The article has the merit of significantly reducing the size of the original essay, while IMO still retaining two strong messages: 1. The original (is it?) method used to prevent spam, and 2. The 'seed' factor, which is expected to make spammers work harder. At mid-page I was thinking "meh, spammers will just have to improve their writing then", but this may not be sufficient thanks to the user-specific seed.

[edit: I didn't realize the original article was from 2002. I agree the article is a bit obsolete at that point.]

1 comments

Modern-day spam is typically generative, and modelling the distribution of "natural e-mail messages" is sadly too naive today. Human beings also understand text through vision, not through bits -- so 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words. That also gave rise to what is probably a more common spam variant today: the text-embedded-as-an-image type. I've long been of the impression that the only proper way to do text analysis is by vision, a more end-to-end solution as it were.
> 1oca1host is just me corrupting the word localhost, but making that inference requires a visual understanding of words

No it doesn’t, and pg explains why in his essay. (Don’t know if the article states this too as since I’ve already read the essay before I didn’t bother to read a summarizing article about it. The essay is really excellent though.)

Quote from the essay:

> I'm more hopeful about Bayesian filters, because they evolve with the spam. So as spammers start using "c0ck" instead of "cock" to evade simple-minded spam filters based on individual words, Bayesian filters automatically notice. Indeed, "c0ck" is far more damning evidence than "cock", and Bayesian filters know precisely how much more.

> [...]

> To beat Bayesian filters, it would not be enough for spammers to make their emails unique or to stop using individual naughty words. They'd have to make their mails indistinguishable from your ordinary mail. And this I think would severely constrain them. Spam is mostly sales pitches, so unless your regular mail is all sales pitches, spams will inevitably have a different character. And the spammers would also, of course, have to change (and keep changing) their whole infrastructure, because otherwise the headers would look as bad to the Bayesian filters as ever, no matter what they did to the message body. I don't know enough about the infrastructure that spammers use to know how hard it would be to make the headers look innocent, but my guess is that it would be even harder than making the message look innocent.

http://www.paulgraham.com/spam.html

And as for your point about text in image I don’t know of any email client today that defaults to showing images from unkown senders.

I receive a lot of spam and it is all very distinct in nature and the Bayesian approach is still the way to go for fighting it I think.

You may very well be true, but then it's a pity that a 2019 article on a 2002 method didn't mention it?