MozillaZine

Proposals for Incorporating Machine Learning in Mozilla Firefox

Friday June 18th, 2004

Blake Ross writes: "I will be doing research this summer at Stanford with Professor Andrew Ng about how we can incorporate machine learning into Firefox. We're looking for ideas that will make Firefox 2.0 blow every other browser out of the water. People who come up with the best 3-5 ideas win Gmail accounts, and if we implement your idea you'll be acknowledged in both our paper and in Firefox credits. Your idea will also be appreciated by the millions of people who use Firefox :-). We'll also entertain Thunderbird proposals."


#34 Re: Learned Screen Scraping ?

by alan8373 <alan8373@deronyan.com>

Saturday June 19th, 2004 8:20 PM

You are replying to this message

It's me again -- I had some more thoughts on my idea above - this could really be applied to almost ANY site - not just news sites. We could teach firefox to 'scrape' any site and just get the guts of what we actually want. Also, maybe we could setup a central repository to store other people's settings for scraped sites so that firefox could refer to this site whenever it goes to any page. Imagine going to a site you've never visited before, and only seeing GUTS - no banners no junk. All of this could be collectively worked on by everyone in the community for every site on the Net. Way cool!! Also, what if we started to incorporate these into an RSS type engine. Imagine getting RSS feeds with the actual body of a news story - and nothing else. You'd never have to visit the actual site again to get the whole story after getting only an RSS teaser or headline. Imagine how the advertisers would be pooping in their pants if Firefox and some sister web site / service provided the Net's only completely ad-free web browsing experience. Sorry for rambling, but I'm thinking more and more that maybe this is good stuff here ... What if this scraping ability was extended into a completely automated feature such that the browser would use bayesian type filtering at the html element level to determine which portions of a web page were 'spam' - ads and such, and what was actually the real guts of the page. Think of SpamAssasin - but run once for each html block on a page. I apologize for the long-winded-ness of this, but I just had to get these ideas out.