MozillaZine

Proposals for Incorporating Machine Learning in Mozilla Firefox

Friday June 18th, 2004

Blake Ross writes: "I will be doing research this summer at Stanford with Professor Andrew Ng about how we can incorporate machine learning into Firefox. We're looking for ideas that will make Firefox 2.0 blow every other browser out of the water. People who come up with the best 3-5 ideas win Gmail accounts, and if we implement your idea you'll be acknowledged in both our paper and in Firefox credits. Your idea will also be appreciated by the millions of people who use Firefox :-). We'll also entertain Thunderbird proposals."


#32 Learned Screen Scraping ?

by alan8373 <alan8373@deronyan.com>

Saturday June 19th, 2004 8:00 PM

You are replying to this message

I have an idea that's best described as Learned Screen Scraping. Here's the gist of it... go to any news site, and you'll invariably find about a dozen or more of what are basically useless portions of the page -- banner ads, links, login info, etc, and other junk that surrounds the actual contents of the story you want to read. What if it were possible to teach firefox to scrape screens by you highlighting the actual body of the news story on a web page and clicking on a 'scrape' or 'learn to scrape' button on the toolbar. The next time a news story is read from that site, firefox could, based on what you taught it, either choose not to display what was not highlighted maybe through some CSS magic, or by doing an in-memory screen scrape and then applying some kind of other CSS stylesheet to just display the body of the story itself. This would work for most sites because of content management systems that typically generate the same html structure for every page on the site with the content only being what changed. So, when you highlight a story with the mouse, and tell firefox to 'learn', it can maybe learn that everthing highlighted, would actually mean somethine like ... go to the html ... body ... table ... tr ... td element, and scrape everything in there. for advanced users, maybe we could even do it with the actual page's html source. For me, I would LOVE a feature like this if I were able to go to sites like eweek.com, and linuxtoday.com without the ads and the other extraneous and useless junk on the news pages. I'm sure the advertisers would not be too happy about it, but I'm also sure other people would like this feature. I know I would!