Sunday, October 02, 2005

PageRank, Markov Processes and Swarm Intelligence

Since no endeavor in web search would be complete without considering Google PageRank, I've been studying the PageRank algorithm today. It overlaps nicely with the Linear Algebra class I am taking, where we have been assigned a paper about an application of linear algebra. As many web developers and people interested in internet search probably know, PageRank attempts to measure the global importance of pages on the Web. It does this by looking at the way links are distributed, counting each link as a vote. If there is a link from Page A to Page B, the author of Page A apparently thinks there is something important about Page B, so the link reflects this. PageRank determines the importance of Page B by looking at all the pages that link to Page B. The amount of importance each page transfers along its links is proportional to the importance of the page. This gives the algorithm a recursive nature.

The algorithm can be reformulated in several ways. If we look at the web as a graph, we can build a normalized adjacency matrix. An adjacency matrix is one where each element, (i, j), represents whether node i is connected to node j. This does not need to be symmetric, node i connected to node j does not imply that node j is connected to node i. In the adjacency matrix, the element (i, j) is 1 if there is a link from Page i to Page j, and 0 otherwise. However, we are looking for a normalized adjacency matrix, and in this case it means each of the columns has been scaled so the sum of all entries in a column add up to 1. This means we have a Markov matrix, which will allow for some other cool things later on. Using the normalized adjacency matrix, we can reformulate the algorithm as a matrix equation. In essence we are now solving for the principle eigenvector of the adjacency matrix.

This adjacency matrix is a a Markov matrix. This means we can interpret each element as representing the probability that a surfer will go to Page j given that the surfer is already at Page i. If we square the matrix, we can find that probabilities that the surfer will get from one page to another in two steps. If we continue this process, eventually the matrix settles into a steady state, which represents the probability that a surfer randomly clicking links will ever get to a particular page. This is basically what PageRank measures.

Looking at things this way gives another way of calculating the PageRank, and some things I've read suggest this is actually the way Google does the calculation. PageRank can be calculated by simulating random surfers starting and random pages and following links randomly, and then looking at how often a surfer arrives on a particular page. If this swarm of surfers doesn't move purely randomly, but instead following probabilities based on how often previous surfers have followed certain links, the system should still converge to the correct PageRank values, but it could happen much more quickly.

You'll notice I used the word swarm up above. This way of looking at the algorithm gives it a striking similarity to many swarm intelligence approaches. Consider instead that we have ants crawing the web. As the ants traverse edges, they deposit pheromones on the edges, which then influence other ants to follow these same edges, dropping more pheromones on the way. This process continues, and the ants reach a sort of steady state. Looking at things this way, it's not too far fetched to call PageRank a swarm intelligence algorithm. Solving for the eigenvector of the adjacency matrix could be approximated by adding pheromones, and modeling diffusion and decay.

Now's the part where my advisor would ask the question "Okay, but what emerges?" This is an important aspect of swarm intelligence -- there has to be some sort of emergent behavior. The quick answer is "We computed PageRank!" However, we don't quite have any intelligence yet, because the ants are just wandering around the web aimlessly. Ants in the real world have a goal, which is to find food and bring it back to the colony. To really start having swarm intelligence, we need to give our web surfing ants a goal. The food they are looking for could perhaps be made analogous to the information the web surfer is seeking. Now the ants have a goal, and they can start laying pheromones intelligently to help other ants with a similar goal find the same thing.

Applying this to the real world could be problematic, however, because we have many surfers that are looking for almost as many different topics. To build off of pheromones from other surfers, someone would have to be looking for information on the same topic. A pheromone could be assigned for each keyword, but keeping track of all of this information would quickly become intractable.

To do: read up on Topical PageRank.

1 Comments:

Blogger Vishal said...

This is a really a very good article. I am really impressed.

11:32 PM  

Post a Comment

<< Home