Hacker News new | ask | show | jobs
by cebka 4049 days ago
This information is merely interesting for some academic researches on some very specific messages corpus. But in the real world, I cannot efficiently evaluate the accuracy because it depends on zillions of parameters. Moreover, since rspamd uses not only statistics but a number of sources, such as DNS lists, SPF, DKIM, hashes databases and so on, it is literally impossible to be determined about preciseness.
3 comments

> This information is merely interesting for some academic researches on some very specific messages corpus.

No. It gives a very rough guide to "how much trouble is this spam filter going to be?". If you can say that X000 users have found only Y% of their email was misclassified, and I can compare against other spam filters, that's really useful.

So yes, too many variables for one to be accurate, but good enough to gauge average-performance across a tribe of users.

I've compared rspamd on random stream of user's messages with SA and Kaspersky antispam several years ago. And I've got almost the same rate of false positives and false negatives for all three products. However, over years spammers are getting much smarter (images spam, valid DKIM, valid SPF and other clever tricks).

Regarding statistics, rspamd uses OSBF-Bayes classifier and 5-gramms input (so it is not naive bayes). I've used the following academic paper: http://osbf-lua.luaforge.net/papers/osbf-eddc.pdf as reference. This algorithm is also used for crm114 spam classifier. However, bayes classifier is a very small part of rspamd (unlike dspamd, for example) and it could be almost useless if you have, let's say, 50 millions of users accounts. Rspamd is targeted for this grade systems.

If I am to invest in the time and energy to switch over my current anti-solution to this, then I want to ahve some level of assurance that it will be more effective than my current scheme.

I agree that spam is a moving target and that is why anti-spam systems need constant updating. My current system (over the last 30 days) rejected 87% (around 45k emails) and accepted 13%. Of that 13% (6600) around 300 were classified as spam by the bayesian classifier in thunderbird. Around 80 were manually classified as spam and added to thunderbird's rules. The thunderbird classifier probably classified 2 ham messages as spam. I don't know of any ham->spam errors in the initial filtering phase.

Should rspamd be expected to do better, about the same, or worse?

From what you are saying, I can conclude that you are using very high scoring for statistical classifier (or basing solely on statistics). This is not an option for a system with millions of users (their accept/reject rate is close to 70/30 percents, as we cannot rely on bayes at all). Therefore, I've never ever evaluated bayes as a single classifier. Nevertheless, I'm using OSB-Bayes as a statistical algorithm for rspamd which has been proven to be a good classifier.
For similar systems (ie small but doing good manual classification when all else fails) I suspect that if more used razor (or again, similiar) we'd achieve better results (razor allows for sharing this data automatically)
> This information is merely interesting for some academic researches

It most definitely is not. It's the most important factor when choosing a spam filter.

False positives are extremely harmful (it can result in loss of communication, which is what you want to avoid the most). A significant amount of false positives is what would make the difference between useful or useless.

Nobody want to tell their users "check your spam mailbox, (the one with dozens of spam messages) for ham every once in a while)".

As I see it, unless you can guarantee that you give zero false positives (which, knowing how certain users compose their mail, is arguably impossible) you still have to do it.

Also I suppose that the false positive/negative rate can only be given on a well defined corpus, I'm not sure there is one that is a good representation of the current and future spam trends, so in the end giving those numbers could be very misleading.