CRM114 Awesomeness

I hate spam!

Which probably puts me in the same camp as 99.99999% of the world. The other 1 in 10 million are, of course, the spammers, who seem to take the space invaders approach to sending e-mail: we'll keep sending you more until you die.

A few years ago I used to only receive perhaps 1 every 100 seconds, which was pretty annoying, but Spamassassin was quite able to filter out 99% of those and let through about 1-2 each day, which I could deal with. My spam levels increased to maybe 1 every 20 seconds, and late in 2005 I implemented a second layer of spam filtering on my laptop using DSpam. This worked quite effectively, but DSpam is really not the tool for the job - it's much more appropriate as a company-wide antispam solution, and potentially as a replacement for Spamassassin. It drove me nuts on my laptop because it's resource usage slowed down the interactive response.

When I got my new laptop at the beginning of the year I decided against continuing with my rather baroque mail setup and to leave the spam filtering on the server. What I didn't realise is that my spam rate had increased again to around 1 every 8 seconds, and it has been slowly driving me to distraction ever since. It seems to have cranked up another notch recently, to perhaps 1 every 3 seconds now, so that 1% making its way through Spamassassin was getting to a very annoying several hundred each day. The longer I took to resolve it, the more time I would be wasting dealing with it every day.

What I chose to apply on this occasion was CRM114, which I had some vague idea might be able to help. I was fairly impressed by the relatively simple install, but what completely blew me away was the speed with which it was able to learn to be useful. Starting from scratch, it seems to be correctly classifying over 90% of my incoming mail after about 12 hours of training, on a total of only 75 'Unsure' messages. Even after only an hour it was getting over 50% (I'll describe my actual CRM114 installation process in a comment below). So far there have been no false positives.

Now that CRM114 is installed I will be able to look into some of it's other mail classifying features too, and I'm really looking forward to that too.

Did you train from scratch

Did you train from scratch with just your own mail (on mistakes it made), or did you use the pre-trained databases as well?

Training from scratch

I've posted my 'How I Did It' now, but yes: it was purely from incoming e-mail and the databases were just initialised to blank.

My CRM114 Installation

CRM114 is installed using the Debian package for Etch. I loosely followed the instructions in /usr/share/doc/crm114/CRM114_Mailfilter_HOWTO.txt.gz after the installation, as follows:

Creating the Initial Files


mkdir crm
cd crm
cssutil -b -r spam.css
cssutil -b -r nonspam.css
cp /usr/share/doc/crm114/examples/mailfilter.mfp .
cp /usr/share/doc/crm114/examples/rewrites.mfp .
cp /usr/share/doc/crm114/examples/priolist.mfp .

I edited mailfilter.mfp and changed the secret password. I edited rewrites.mfp to set my personal e-mail addresses, mail servers, mail names, etc, and I edited priolist.mfp mostly to just comment everything out, for the time being. Note that there's no whitelist/blacklist there because I'm going to use mailreaver, which is newer than mailfilter.

Running mailreaver from Procmail

I went into my IMAP client and created a "CRM" folder and several folders under that called "Spam", "NewSpam", "TrainingHam", "Unsure" and "Trained". Once they were created, I went onto the server and found the directories that were being used. In my case they were "$MAILDIR/.CRM.Spam/cur" and variations, so I added the following stanza into my .procmailrc:


# Run CRM114 mailreaver
:0fw: .msgid.lock
| /usr/bin/crm -u /home/andrew /usr/share/crm114/mailreaver.crm

:0:
* ^X-CRM114-Status: SPAM.*
$MAILDIR/.CRM.Spam/new

A script in cron

Firstly, I added another rule so that anything with a subject starting 'UNS: ' gets moved to the 'Unsure' mailbox. Then I was ready to start moving stuff by hand from 'Unsure' either to 'NewSpam' (for spam) or 'TrainingHam' for things that aren't spam. I deliberately chose the names to be distinctively different to make it easier to differentiate them.

Finally I wrote the following very short shell script, which I will run from cron every ten minutes for a while, and probably reduce to every few hours later:


#!/bin/bash

for SPAM in Maildir/.CRM.NewSpam/cur/*; do
  [ ! -f "${SPAM}" ] && break
  /usr/bin/crm -u /home/andrew --spam /usr/share/crm114/mailreaver.crm <"${SPAM}" >/dev/null
  mv "${SPAM}" Maildir/.CRM.Spam/cur
done

for HAM in Maildir/.CRM.TrainingHam/cur/*; do
  [ ! -f "${HAM}" ] && break
  /usr/bin/crm -u /home/andrew --good /usr/share/crm114/mailreaver.crm <"${HAM}" >/dev/null
  mv "${HAM}" Maildir/.CRM.Trained/cur
done

Training

The first ten or so messages all came through as 'Unsure', as you would expect, and I moved them into the NewSpam and TrainingHam folders as appropriate. Pretty soon after that the first mail started getting identified one way or the other, and it just got better and better.

mutt integration

Hi,

I am training crm114 with a few shortcuts from within mutt (sorry, very long lines):


macro index \es "|grep -a -v \"X-CRM114-\" |crm -u /home/jrschulz/.crm114 /home/jrschulz/.crm114/mailfilter.crm --force --learnspam |grep -a \"X-CRM114-\" \n=spam\n" "crm114 learn as spam"

macro pager \es "|grep -a -v \"X-CRM114-\" |crm -u /home/jrschulz/.crm114 /home/jrschulz/.crm114/mailfilter.crm --force --learnspam |grep -a \"X-CRM114-\" \n=spam\n" "crm114 learn as spam"

macro index \eh "|grep -a -v \"X-CRM114-\" |crm -u /home/jrschulz/.crm114 /home/jrschulz/.crm114/mailfilter.crm --force --learnnonspam |grep -a \"X-CRM114-\" \n" "crm114 learn as ham"

macro pager \eh "|grep -a -v \"X-CRM114-\" |crm -u /home/jrschulz/.crm114 /home/jrschulz/.crm114/mailfilter.crm --force --learnnonspam |grep -a \"X-CRM114-\" \n" "crm114 learn as ham"

That way you can just use crm114's header to sort mails into the proper directories and press two keys to tell crm114 when it failed (Esc-h for "this is ham", Esc-s for "this is spam).

Additionally, you can tell mutt how to recognize the spam score by using

spam "^X-CRM114-Status: SPAM . pR: ([-0-9]+\.[0-9])" "%1"

Then you can use the pattern ~H to colorize spam in the index. You can even sort your mailbox containing spam (or "Unsure") by the spam score assigned by crm114.

dovecot-antispam goodness

Hi Andrew,

your post inspired me to go off and give CRM114 a try. I've also found what looks like the best way to integrate CRM114 training with IMAP so far, and that's the dovecot-antispam plugin. Functionally equivalent to what your cron scripts do, but does so in real time.

FWIW, here's what I had to to to set it up:

  1. Install CRM114 to ~/crm as per the instructions. The only significant changes I made to the default Debian etch configuration were to remove the ADV/UNS tagging of subjects, since I don't need this.
  2. Create the "crm-spam" and "crm-unsure" folders, for holding SPAM and "unsure" email respectively.
  3. Set up my ~/.mailfilter to filter mail through CRM114:
    xfilter "/usr/bin/crm -u $HOME/crm $HOME/crm/mailreaver.crm"
    if (/^X-CRM114-Status: SPAM/:h)
    to Maildir/.crm-spam/
    if (/^X-CRM114-Status: UNSURE/:h)
    to Maildir/.crm-unsure/
    # [...] At this point we know CRM114 thinks the message is Ham, so we file it as normal
  4. Download and install the dovecot-antispam plugin, and configure it as follows:

    plugin {
    # semicolon-separated list of Trash folders
    antispam_trash = trash;Trash;Deleted Items

    # semicolon-separated list of spam folders
    antispam_spam = crm-spam

    # semicolon-separated list of unsure folders
    antispam_unsure = crm-unsure

    # crm114-exec plugin

    # mailreaver binary
    antispam_crm_binary = /usr/bin/crm

    # semicolon-separated list of extra arguments to crm
    antispam_crm_args = -u;%h/crm;%h/crm/mailreaver.crm

    # NOTE: you need to set the signature for this backend
    antispam_signature = X-CRM114-CacheID
    }

This left me with a setup where:

  • Ham goes to my INBOX, or whichever subfolder I have set in ~/.mailfilter (e.g. for mailing lists).
  • SPAM goes to the crm-spam folder.
  • UNSURE email goes to the crm-unsure folder.
  • To train messages that turn up in the crm-unsure folder, I just move them into the appropriate folder (crm-spam if SPAM or somewhere else if Ham).
  • To retrain Ham misclassified as SPAM, I move it from the crm-spam folder to wherever it belongs.
  • To retrain SPAM misclassified as Ham, I move it from whichever folder it turned up in to the crm-spam folder.

How much easier can it get? :-)

Now to see how well CRM114 does after it's had some serious training.

-mato

[D] [Digg] [FB] [R] [SU] [Tweet]