Discussion
There are many problems associated with detecting spam for the final recipient of an email. It is important to understand these problems in order to understand what Bayesian self-learning is and how it fits into Kerio's solution for spam protection.
Terminology
- Spam is email the recipient considers to be unsolicited junk email.
- Ham is email the recipient considers to be not spam.
- False Positives are email that was incorrectly marked as spam.
- False Negatives are email that was incorrectly marked as ham.
The Problem of SpamAssassin
The main problem of SpamAssassin is that it uses static rule sets to determine if a message is spam. This is a problem because a fixed set of rules cannot accurately define spam for everybody. The result of this is that SpamAssassin can capture most spam, but it will always have some false positives and false negatives.
The other problem with SpamAssassin's static rules is that the content in spam changes over time so that we have mutating spam. Unless the rules in SpamAssassin change over time also, more and more spam is going to get in. This means that constant upgrades are necessary to maximize spam blocking capabilities.
Bayesian Filtering
Bayesian filtering is the answer to the SpamAssassin problem. The Bayes database can be trained by the recipient so that it knows what messages look like spam, and what messages are ham. The Bayesian filter does this by breaking a message up into many small pieces called tokens then determines which tokens occur mostly in spam messages, and which ones occur mostly in messages that are ham.
The Problem of Bayesian Filtering
The problem with Bayesian filtering is that it must learn a lot of emails before it can function effectively. Bayesian filtering does not even begin to work until it has learned at least 200 spams and 200 hams. End-users would need to work hard to train the Bayes database enough to effectively fight mutating spam.
Bayesian Self-Learning
Bayesian self-learning is the answer to the Bayesian filtering problem.
The scoring system in SpamAssassin is used for this purpose. The higher the SpamAssassin score, the more sure we are that the message is a spam. The lower the score, the more sure we are that it is ham. The following criteria defines whether or not SpamAssassin will train the Bayes database about a message:
- If the total SpamAssassin score is above 12, and both the header score and body score are above 3, then train the Bayes database about the spam.
- If the total SpamAssassin score is below 0.1, then train the Bayes database that it is not a spam.
Since SpamAssassin scans all incoming emails using it for Bayesian self-learning is a very effective way to keep the Bayes database up to date.