Block Structure of the Stanford-Berkeley Web Graph

Convergence Rate of Quadratic Extrapolation

Node PageRank vs. Node Convergence Rate

Linkage Density of Sites on the Web

Efficiently multiplying Ax in PageRank

Personalized Search

2002 - 2007

In web search, query language is generally sparse and ambiguous -- the average length of a query is less than 2.5 words. But like in human language, the context around the user issuing a query (such as his location, long-term interests, or near-term task) can often give insight into his information need.

In the original PageRank paper that described Google's search engine, Page and Brin describe an algorithm for personalized PageRank, wherein a customized PageRank vector is computed for each search user based on such context. This would allow search results to be tailored to the person as well as the query, so that, for example, a baseball fan in San Francisco who searches for "giants" will get different results than a football fan in New York who performs the same query.

The computation of personalized PageRank as described involved computing the principal eigenvector of a different web matrix for each user. Since there are ~1 billion search users and the order of the web matrix is ~10 billion, this computation would have taken several million machines with existing methods. Our research in personalized search focused at first on developing efficient eigensolvers to make it possible to compute personalized PageRank in practice, and later in building a large-scale search engine, called Kaltix, based on these algorithms.

This is joint work with Taher Haveliwala, Glen Jeh, Rajeev Motwani, Chris Manning, and Gene Golub.

Books

  • Numerical Algorithms for Personalized Search in Large-Scale Self-Organizing Information Networks. Princeton: Princeton University Press. 2010.

Selected Publications

  • Extrapolation Methods for Accelerating PageRank Computations. Proceedings of the Twelfth International World Wide Web Conference, May, 2003. (with Taher Haveliwala, Chris Manning, and Gene Golub).
  • Exploiting the Block Structure of the Web for Computing PageRank. Stanford University Technical Report. June, 2003. (with Taher Haveliwala, Chris Manning, and Gene Golub).
  • The Second Eigenvalue of the Google Matrix. Stanford University Technical Report. June, 2003. (with Taher Haveliwala).
  • Adaptive Methods for the Computation of PageRank. Linear Algebra and its Applications, Special Issue on the Numerical Solution of Markov Chains. November, 2003. (with Taher Haveliwala and Gene Golub).
  • An Analytical Comparison of Approaches to Personalizing PageRank. Stanford University Technical Report. June, 2003. (with Taher Haveliwala and Glen Jeh).
Companies

  • Kaltix. Founder and CEO. Personalized search engine acquired by Google in September 2003.
  • Google. Head of Personalization. 2003-2007.

Data

  • Stanford Web Matrix. 281903 pages, ~2.3 million links. 11.7MB zipped, 64.2MB unzipped. From a September 2002 crawl. Matlab format. Note:loadStanfordMatrix.m loads the matrix such that the rows represent the inlinks of a page, and the columns represent the outlinks.
  • Stanford-Berkeley Web Matrix. 683446 pages, ~7.6 million links. 32.5MB zipped, 125.8MB unzipped. From a December2002 crawl. Matlab format. Note:loadSBMatrix.m loads the matrix such that the rows represent the inlinks of a page, and the columns represent the outlinks.

Talks

  • Extrapolation Techniques for Computing PageRank. At WWW 2003.

Code

  • Basic PageRank Algorithm. in Matlab.