MozillaZine

Experimental Mozilla Thunderbird Build with Improved Junk Mail Classification Available

Monday February 16th, 2004

A new test build of Mozilla Thunderbird with an improved junk mail detection system is now available. The binary, for Windows, uses an enhanced version of the Bayesian spam classification algorithm and allows users to fine-tune the sensitivity of the filter with a hidden preference. Some testers have reported that this new code catches twice as much spam as the old version. Downloaders are advised to start off with a new training.dat file and to allow for some retraining before judging the effectiveness of the new system. Read Scott MacGregor's post to the Thunderbird Builds forum for more information about this experimental junk mail build, including download links and tips.

#1 suite

by sime

Monday February 16th, 2004 10:52 PM

is this going to touch the suite? Or is the suite now slowing down.

#2 will it keep the existing training?

by smkatz

Tuesday February 17th, 2004 10:15 AM

I worked hard to train the Mozilla Mail filter going back and marking old junk mail (it is statistical.. and because we are talking about a small numbers of spammers and list, past spam will represent future spam so this is useful. I find that I must mark every single spam message that is junk junk, and any improperly marked messages not junk in order for it to work. After that, it works well. So long as I *always* mark spam and *always* inform it when it makes a extremely rare error. My errors are generally in the fact that I have to mark similar messengers from the same sender for instance about twice and maybe even a third time before it adjusts corrrectly, (in my case, that Wesley Clark was not a spammer.)

The only area of improvement would be that it should keep track of recently marked spam messages by their Message-ID header so that duplicates I've already maked can be treated the way I asked. (as Junk or as "Not Junk"). It should compare recently used subjects and senders in a similar manner to speed up training. it's fine if it wants me to review those seperately and say "correct, these (collectively) are spam", but having people mark duplicate messages I've already marked makes novice users confused as to the filter's effectiveness. Because it is statistical, it only needs to keep recent ones, no need to have another cache to clear.

But I am very satisfied with the performance of the spam filter, which now has adapted correctly, and misses very little spam, and marks no false positives. Will this "new" code really use the old training.dat file to the same effectiveness?