Common project themes
Processing large amounts of web data
- Detect near duplicates in 180M web pages
- Store the connectivity graph of the web (~180M nodes, ~290M edges)
- Correlation analysis of AltaVista query logs(1B requests, 280M user sessions over 42 days)
- Comparing search engine sizes(perform 10’s of thousand searches)
Web is a distributed database we should exploit
- Lots and lots of neat stuff out there !
- We want to automate “web computations” like
- filling in web forms,
- extracting data from web pages,
- rewriting the contents of web pages, Etc.