Hacker News new | ask | show | jobs
by jerf 901 days ago
We'll get there eventually, but it will be a bit. Spam classification at scale is already a compute-bound, or at least compute-starved, operation. Spam classification systems already do what they can to avoid so much as invoking a virus scanner if they can avoid it, because at scale it's so expensive. LLM-based spam classification is another order of magnitude more expensive and would require hardware that current spam systems do not have.

But that's a problem that will resolve itself over time, in a variety of ways. And the spam systems can play the same tricks with only invoking it on a fraction of emails too, of course. It's just at current expense levels, that would be a very small fraction indeed. I'd hazard that trying to use modern AI on spam classification at scale could easily consume 10x-100x of all current AI hardware and still make less of a dent than you'd hope.

1 comments

It doesn't need to be computationally costly because, as you seem to imply, there are tiers of cost tradeoffs. You can invoke a very cheap classifier at SMTP time, that is biased to have few false positives, that will temporarily reject all that which is highly likely to be spam. You can do this without even glancing at the body. Of course, having signals about peer reputation is the strong suit of Gmail or Microsoft, and the distributed, open community would need to solve the problem of promptly updating and distributing such reputation signals. And by "promptly" I mean within seconds of the leading edge of an attack.

Then there are increasing tiers of cost that you would only run after it becomes likely that the message is acceptable. As you say, you would only run an antivirus on a message on the verge of delivery, because decoding the attachment and running the AV (in an expensive sandbox) is so costly.