MozillaZine

Proposals for Incorporating Machine Learning in Mozilla Firefox

Friday June 18th, 2004

Blake Ross writes: "I will be doing research this summer at Stanford with Professor Andrew Ng about how we can incorporate machine learning into Firefox. We're looking for ideas that will make Firefox 2.0 blow every other browser out of the water. People who come up with the best 3-5 ideas win Gmail accounts, and if we implement your idea you'll be acknowledged in both our paper and in Firefox credits. Your idea will also be appreciated by the millions of people who use Firefox :-). We'll also entertain Thunderbird proposals."


#70 Re: Nuke Anything + Bayesian Filter

by phaasz <phaasz@hotmail.com>

Tuesday June 22nd, 2004 7:51 AM

You are replying to this message

Hmm sounds cool. I think some thought would need to go into identifying the "unit" for the engine to use though. For email, it's easy; the unit is a single message. It seems that su is suggesting a scheme where you could select arbtrary text (eg "hate speech"), which points to a switch to tokenising arbitrary numbers of consecutive words (which would be resource intensive)... Or are you suggesting some other tokenising scheme?

How about something which blocks entire urls based on bayesian filtering (whether the resource be an image, web page, etc )? But I would not just look at the url string itself (as most adblocking software currently does), as Bayesian filtering is most effective when more information is available - eg for spam, taking into account entire email content including headers, etc. Which means we need more than just a url in order to be smarter...

We could also parse parts of (or all of) the linking page: - the html element in which the url appears (a, img, etc) - the parameters of that element (eg height, width, alt text ) - the style of that element - other contextual aspects?

And for the resource itself: - http header fields (after which we could possibly drop the connection?) - the entire page (for web pages)

This scheme could then be trained for: - blocking offensive content - blocking ads - blocking certain content types (eg swf) - blocking large content (by reading http header)

To do all of these things, of course, the engine would need to be semantically aware (eg treat a number in the content-length http field differently from that same number in a url), but this shouldn't be too hard using a scheme similar to that suggested in Paul Graham's second attempt: <http://www.paulgraham.com/better.html>.

...enough rambling!