MozillaZine

Proposals for Incorporating Machine Learning in Mozilla Firefox

Friday June 18th, 2004

Blake Ross writes: "I will be doing research this summer at Stanford with Professor Andrew Ng about how we can incorporate machine learning into Firefox. We're looking for ideas that will make Firefox 2.0 blow every other browser out of the water. People who come up with the best 3-5 ideas win Gmail accounts, and if we implement your idea you'll be acknowledged in both our paper and in Firefox credits. Your idea will also be appreciated by the millions of people who use Firefox :-). We'll also entertain Thunderbird proposals."


#55 Thunderbird automatic filtering

by leafdigital

Monday June 21st, 2004 5:24 AM

You are replying to this message

My mail would benefit from being categorised, but I'm too lazy. The software should do it for me. (This is not quite the same as somebody else's suggestion for using manually configured categorisation that can take advantage of the Bayes engine.)

There are a number of ways you could create an automated filtering system. One very simple way would be to use the 'to' address; if you receive a number of messages to a specific 'to' address (e.g. <webmaster@example.com>) that isn't your main account address, it could automatically create a new folder and filter those messages for you. You'd want this to work in a 'sensible' way so that it didn't bother creating a new folder if you only get 1 message per month to that address, or if somebody sends a bunch of spams there, but only if you actually would be likely to want it.

(For the UI I would implement this as 'categorisation' - the AI system 'categorises' messages according to these rules, makes the categories itself, and may create automatic filters, but you can change the filters by assigning messages from a category to go into a different folder etc.)

Just using 'to' addresses would actually be a really good start for some users; then extending it to 'from' addresses might also help (e.g. you get a lot of mail from your boyfriend, why not filter that elsewhere). You might then start looking at other factors such as text classification (the existing Bayes system), but I'm uncertain as to how best to implement that. Vector quantisation techniques might help. Perhaps this could only work in conjunction with user feedback.

This kind of system does rather reveal the flaws in the 'folder' system of mail storage in the first place; clearly a better kind of mail database ought to be used which *doesn't* categorise mail insofar as its physical storage location (i.e. just store by date received in some easily backupable manner) but applies one or more categories to each piece of mail, as set by the user or by AI systems. I can see that being a UI research topic, though it's not really AI. (Basically, you can still see mail as folders if you like, but it could be in more than one folder.) I'm sure it has been implemented before.

Incidentally, I know this isn't the place for it but bloody hell Thunderbird really needs to store the 'filter' setting per-folder... in my inbox I only view unread, everywhere else I want view all... it's cool that you can create your own filters and stuff, great, very nice, but sort of missing on that basic functionality.

Another obvious flaw in the current program is the spam detector (which is good, but could be improved); the solid Bayesian text-only-ignore-the-headers approach that appears to be used means it misses out obvious opportunities for learning like 'mail sent to <real.address@example.com> or <another.real.address@example.com> is more likely to be nonspam', after user marks some mails to those addresses [and as a converse, mail sent to <chat@example.com> or <sdgsgsdgsdg@example.com> is more likely to be spam]. Isn't there some way to take account of at least these basic headers? It doesn't appear to do so currently.

I don't think Fire* really needs AI to any great extent except for one issue which somebody else mentioned - automatic bookmark categorising. Yes please. Could be tricky, though... The history search is a great idea too.

--sam