SpamAssassin collective education

As a bayesian spam filter, SpamAssassin's efficience greatly improves when one educates him ; i.e., we have to tell him what kind of mail is considered as spam or not by our users, and to report him his mistakes.

The setup described bellow :

Overview

Seeding the database

To begin with, we have to seed the database with thousands of both spam and ham, with :

  • spam : http://spamarchive.org
  • ham : mail in our users' inbox which has the the 'reply' Maildir flag set. The assumption is that mail which someone replied to was probably not spam.

It works best if the spam and ham databases are roughly of equal size.

Daily training

A typical practice to keep the database size down and make the system more efficient, is to train only on mistakes :

  • false-negatives : spam not detected ;
  • false-positives : legitimate email erroneously detected as spam ;

but... we won't get enough false positives, hopefully, for this to be efficient. That's why we'll train SpamAssassin with his mistakes and with the email which have been replied to.

Of course, mail users have to be said how they can contribute ; an example in french : AutoAideMail#Faire sa peau au spam.

False-negatives life-cycle

IMAP users

They simply have to move the false-negatives to their Spam IMAP folder.

A daily cronjob runs the sa-education-false-negatives script that injects into 'sa-learn --spam' the mail located in the Spam folders, less than 4 days old, but more than 3 days old.

It doesn't matter if these messages are false-negatives or spam already detected, since SpamAssassin keeps track of the messages it has already learnt and ignores them accordingly.

Note: old mail is is automatically deleted from this folder by courier-imap, thanks to the setting IMAP_EMPTYTRASH=Trash:30,Spam:30 in /etc/courier/imapd.

Webmail users

The SquirrelMail's Spam Buttons plugin places a "Spam" button on the message list page as well as on the message view page. This button moves the selected messages to the Spam IMAP folder.

POP users

They simply have to bounce (not forward) the false-negatives to a spamtrap.

Spamtrap

A spamtrap is a mailbox that is periodically used to feed sa-learn --spam.

A typical practice is to create a Unix user called as you want, and have all of his mail directly piped to sa-learn with procmail, as explained on this page and on this one.

But we dislike creating Unix users, that's why :

  • we create a virtual mailbox 'spam@boum.org'
  • we tell AMaVis to apply no virus or spam filter on the mail sent to this address : $spam_lovers{lc('spam@boum.org')} = 1; and $virus_lovers{lc('spam@boum.org')} = 1; in /etc/amavis/amavisd.conf (or /etc/amavis/conf.d/50-user, for Debian etch)
  • a daily cronjob runs the sa-education-spamtrap script, that :
    • feeds sa-learn --spam with the mail received by the spamtrap, ignoring the Resent-* headers by using the sa-education-spamtrap user prefs custom SpamAssassin config file ;
    • deletes this mail.

It's important to allow only our authentificated users to send mail to this address. This can be achieved by using Postfix smtpd_recipient_restrictions and more specifically the check_recipient_access restriction to check an access table.

E.g. in /etc/postfix/main.cf :

smtpd_recipient_restrictions =
   permit_sasl_authenticated,
   permit_mynetworks,
   check_recipient_access hash:/etc/postfix/recipient_access,
   reject_non_fqdn_hostname,
   reject_unauth_destination,
   < ... additionnal RBL checks and policy daemons calls ... >

And in /etc/postfix/recipient_access :

spam@boum.org   REJECT

Ham life-cycle

A weekly cronjob runs the sa-education-ham script, that injects into sa-learn --ham the messages that have been replied to in the last 7 days.

False-positives life-cycle

NB : it is important to have the Spamassassin setting report_safe set to 0, else the false positives are stored attached to the spam report email, and then it's hard to inject them into sa-learn --ham.

IMAP users

They simply have to move the false-negatives to their NonSpam IMAP folder.

A daily cronjob runs the sa-education-false-positives script that : inject into 'sa-learn --ham' the mail less than 1 day old located in the NonSpam folders.

Note: old mail is is automatically cleaned from this folder by courier-imap, thanks to the setting IMAP_EMPTYTRASH in /etc/courier/imapd.

Webmail users

The SquirrelMail's Spam Buttons plugin places a "Not Spam" button on the message list page as well as on a single message view page. This button moves the selected messages to the NonSpam IMAP folder.

POP users

They simply have to bounce (not forward) the false-positives to a nonspamtrap.

NonSpamtrap

A nonspamtrap is a mailbox that is periodically used to feed sa-learn --ham. It works exactly like the spamtrap described above, with a sa-education-nonspamtrap script doing so.

Bonus hacks

  • spam and nonspam folders-1.5.1.diff: tell squirrelmail the Spam and NonSpam folders are special ones (it will then display them in a special way, and prevent them to be deleted, at least)

Sources

Information and sentences stolen on :