Hacker News new | ask | show | jobs
by pbowyer 4049 days ago
> This information is merely interesting for some academic researches on some very specific messages corpus.

No. It gives a very rough guide to "how much trouble is this spam filter going to be?". If you can say that X000 users have found only Y% of their email was misclassified, and I can compare against other spam filters, that's really useful.

So yes, too many variables for one to be accurate, but good enough to gauge average-performance across a tribe of users.

1 comments

I've compared rspamd on random stream of user's messages with SA and Kaspersky antispam several years ago. And I've got almost the same rate of false positives and false negatives for all three products. However, over years spammers are getting much smarter (images spam, valid DKIM, valid SPF and other clever tricks).

Regarding statistics, rspamd uses OSBF-Bayes classifier and 5-gramms input (so it is not naive bayes). I've used the following academic paper: http://osbf-lua.luaforge.net/papers/osbf-eddc.pdf as reference. This algorithm is also used for crm114 spam classifier. However, bayes classifier is a very small part of rspamd (unlike dspamd, for example) and it could be almost useless if you have, let's say, 50 millions of users accounts. Rspamd is targeted for this grade systems.