Hacker News new | ask | show | jobs
by jeffbee 901 days ago
It is not and has never been a good classifier. If open AI fans want to contribute something of value to society, they would train a spam classifier on a large, manually-labeled corpus of mail, where the features include envelope data. That would get open source maybe 10% of the way to Gmail quality, or 100x better than SA.
7 comments

FWIW I've been using SpamAssassin for over a decade personally (partly to avoid Google dependence), and it's been pretty darn good once I ran the Bayesian learning thing a few times many years ago. I get like 3-5 spams per week in my inbox. Do others really consider SA that bad?
FastMail uses SpamAssassin, and I get less than one spam emails in my inbox a month, with essentially zero false positives (which is the tricky bit where gmail seems to fail – I'd rather have the occasional spam in my email than false positives).

In short: you can probably do better than 3-5 spams per week with SA.

The big problem is the entire thing is a beast to configure with all the documentation of a Babylonian cuneiform stone tablet.

I hate to agree here, but configuring SpamAssassin is pretty rough. That being said, once its done, its pretty bulletproof and doesnt usually require messing with it all the time.
> I get like 3-5 spams per week in my inbox

I'm more curious about the opposite metric: how many non-spam emails a week arent getting delivered to you? Because that seems to be the real flaw in spamassassin: the false positive rate.

And the spamassassin users don't usually have much visibility into this, so when emails don't get to them they just blame the sender.

> how many non-spam emails a week arent getting delivered to you?

My false-positive rate is very low, maybe a couple per month. However, I can predict with a high degree of accuracy when a piece of email is likely to land in the spam folder. Things like confirmation emails, registration emails, etc. are guaranteed to land in the spam folder. It's pretty hard for any system to accommodate those without allowing spam to get by.

That's fine by me, though, because I know when to check my spam folder.

Fastmail user here, so SpamAssassin I assume: virtually no false positives. My GMail spam folder is generally 50/50 false and true positives. I really can't use that email for anything as having to go to the spam folder every day defeats the purpose of an anti-spam filter.
On Gmail I get maybe 1 spam a month max in my inbox (and it blocks many per day)
That's pretty good. I have no doubt that Gmail spam protection is better than my self-hosted SA protection. For me, independence from a somewhat suspicious large company for something as important as e-mail is worth it.
I have a gmail honeypot where I fetchmail junk email straight to my junk folder and have a scheduled sa-learn cronjob. Ever since I started this I essentially stopped getting junk email in my selfhosted inbox.

I also have dovecot set to learn Ham every time I file an email from the inbox to a folder for good measure.

So...statistically insignificant difference from SA for most mail users.
3-5 spams per week vs. 1 per month is a big difference
I got (if I'm reading this right) 5500 emails (junk+delivered) to my personal mail account from 12/01/2023 to 12/31/2023. So that's a minimum, since I don't see the ones that get flushed out before I even see them.

1 spam a month would be .018% of emails and 5 x 4 spams a month would be .364% of emails

So I would have gotten about .346% more spam based on the number of emails. In reality, because I don't see all of the mails, it's less. Is a touch more than a third of a percent a 'big difference'? YMMV.

It's a 20x difference... My annoyance by spam is unrelated to the detection rate and is entirely related to the number of spam emails I see
I find SA works excellently, personally.
We'll get there eventually, but it will be a bit. Spam classification at scale is already a compute-bound, or at least compute-starved, operation. Spam classification systems already do what they can to avoid so much as invoking a virus scanner if they can avoid it, because at scale it's so expensive. LLM-based spam classification is another order of magnitude more expensive and would require hardware that current spam systems do not have.

But that's a problem that will resolve itself over time, in a variety of ways. And the spam systems can play the same tricks with only invoking it on a fraction of emails too, of course. It's just at current expense levels, that would be a very small fraction indeed. I'd hazard that trying to use modern AI on spam classification at scale could easily consume 10x-100x of all current AI hardware and still make less of a dent than you'd hope.

It doesn't need to be computationally costly because, as you seem to imply, there are tiers of cost tradeoffs. You can invoke a very cheap classifier at SMTP time, that is biased to have few false positives, that will temporarily reject all that which is highly likely to be spam. You can do this without even glancing at the body. Of course, having signals about peer reputation is the strong suit of Gmail or Microsoft, and the distributed, open community would need to solve the problem of promptly updating and distributing such reputation signals. And by "promptly" I mean within seconds of the leading edge of an attack.

Then there are increasing tiers of cost that you would only run after it becomes likely that the message is acceptable. As you say, you would only run an antivirus on a message on the verge of delivery, because decoding the attachment and running the AV (in an expensive sandbox) is so costly.

I hope against hope that AI spam detection never becomes a thing. At least with today's methods, I can tell a person why their message was marked as spam. If AI detection becomes the norm, all I can do is shrug and say, "Sorry, it's the algorithm."
Gmail has used machine learning to classify spam since its creation.

https://workspace.google.com/blog/identity-and-security/an-o...

Gmail is successful because it naturally is the biggest honeypot. Most antispam API filters are like accumulators. When a trend is detected, the rest are protected. But overall, it's about scale.
SpamAssassin has a Bayes filter that you can train with ham and spam. This has basically been a thing since forever.
"A plan for Spam" (2002) - https://paulgraham.com/spam.html
How do you think SpamAssassin/gmail/outlook created their spam rules?
Correct, Spamassassin will not classify emails be default. But it does include a Bayesian classifier and tooling that can be used to train it with curated ham and spam emails. It does require extra steps to set this up and feed it selected maildirs (for example, exclude inbox, spam and trash for training ham).

https://spamassassin.apache.org/full/3.0.x/dist/doc/sa-learn... https://cwiki.apache.org/confluence/display/spamassassin/Bay...

I find it far more likely that AI will be used (indeed, is used already) to generate spam, rather than filter it.
Really any of the open source language models might work well enough for the job. If you could manage to get a classifier that runs with tensorflow to take advantage of a coral tpu it would certainly be a major step up with managable performance.
Does SpamBayes still work?
The Python project that's clent side? Best of my knowledge it hasn't been updated in years, spamassasin's in built Bayesian classifier works just fine.
I still used spambayes up until I basically abandoned my self-hosted e-mail setup (in favor of an @gmail).

It occurred to me recently that LLM-style tokenization + bayesian classification would be a sweet upgrade for spambayes, which always struggled with ad-hoc tokenization rules.

(I don't think of it as "client side"; it was integrated with my system via procmail on what I'd call the "server side". You could use it in other ways, including as an Outlook plugin, way back in the day. Or it could connect to an IMAP mailbox and filter messages it found, etc. Really versatile tool for its time)