Tuesday, October 25, 2005

What do search engines solve?

The obvious answer to this question is "Search engines solve the problem of finding information." This makes sense to most people, but from a technical standpoint it is difficult to evaluate. What information is the user trying to find? Arguably any page on the web contains information, so a search engine would not be necessary, since most users can find a web page without using a search engine. In fact, the search engine itself is a web page, so by finding the search engine the user has already accomplished the goal.

Now, most people will recognize quickly that your average web surfer doesn't want to find just any information, the surfer is looking for one particular piece or type of information. Let us assume that the surfer is looking for the answer to some question. This won't always be the case, as some people simply browse the web just hoping to find something interesting, but for the purposes of designing a search engine it's reasonable to assume that the user is looking for something.

So, search engines help the user answer a particular question by finding a page that answers that question on the web. Another logical thing to ask is which search engine is the best? We could rank a search engine by just whether the user was successful in their search---did they ever find the answer to the question they were seeking to answer? This simply gives a yes or no answer, but we could look at the percentage of successful searches over a long period of time to decide which search engine is the best.

Whether the user eventually finds what they are looking for or not is not all we should consider in ranking a search engine. Users could probably eventually find the information without ever using a search engine, so using a search engine doesn't offer any improvement using only this criteria. Instead we want to measure how well a search engine helps a user answer the question. A logical measure of this is the time it takes to find the answer. Time isn't terribly reliable though, because some users read slower, or they get distracted. A better way would be to count the number of hyperlinks followed or pages visited to find the answer. We could refer to this as the path length.

Measuring the average path length to find the answer to a question for a particular search engine seems to give a good metric for evaluating a search engine. This gives a way to test "how good are the search results?" which is a very vague question and hard to answer quantitatively. So, to look at our original question, "what do search engines solve," it looks like we can say that search engines attempt to minimize the path length to answer a question.

This is useful to me, because it gives me a target for my project, and a way to objectively compare my results against other search engines. The solution is now testable, and also allows me to more precisely answer questions like "what emerges" when using a swarm intelligence approach.

Notice that the problem of minimizing path length to an answer has other solutions besides search engines. The focus of this is more on the user experience, rather than the quality of the search engine, which is definitely where the focus should be. It also means that I no longer have to be limited to a search engine to solve this problem, but I can apply swarm intelligence in another way to optimize the user experience.

It seems like I should design an experiment to test search engines and gather some data to compare the project I eventually produce against.

Sunday, October 02, 2005

PageRank, Markov Processes and Swarm Intelligence

Since no endeavor in web search would be complete without considering Google PageRank, I've been studying the PageRank algorithm today. It overlaps nicely with the Linear Algebra class I am taking, where we have been assigned a paper about an application of linear algebra. As many web developers and people interested in internet search probably know, PageRank attempts to measure the global importance of pages on the Web. It does this by looking at the way links are distributed, counting each link as a vote. If there is a link from Page A to Page B, the author of Page A apparently thinks there is something important about Page B, so the link reflects this. PageRank determines the importance of Page B by looking at all the pages that link to Page B. The amount of importance each page transfers along its links is proportional to the importance of the page. This gives the algorithm a recursive nature.

The algorithm can be reformulated in several ways. If we look at the web as a graph, we can build a normalized adjacency matrix. An adjacency matrix is one where each element, (i, j), represents whether node i is connected to node j. This does not need to be symmetric, node i connected to node j does not imply that node j is connected to node i. In the adjacency matrix, the element (i, j) is 1 if there is a link from Page i to Page j, and 0 otherwise. However, we are looking for a normalized adjacency matrix, and in this case it means each of the columns has been scaled so the sum of all entries in a column add up to 1. This means we have a Markov matrix, which will allow for some other cool things later on. Using the normalized adjacency matrix, we can reformulate the algorithm as a matrix equation. In essence we are now solving for the principle eigenvector of the adjacency matrix.

This adjacency matrix is a a Markov matrix. This means we can interpret each element as representing the probability that a surfer will go to Page j given that the surfer is already at Page i. If we square the matrix, we can find that probabilities that the surfer will get from one page to another in two steps. If we continue this process, eventually the matrix settles into a steady state, which represents the probability that a surfer randomly clicking links will ever get to a particular page. This is basically what PageRank measures.

Looking at things this way gives another way of calculating the PageRank, and some things I've read suggest this is actually the way Google does the calculation. PageRank can be calculated by simulating random surfers starting and random pages and following links randomly, and then looking at how often a surfer arrives on a particular page. If this swarm of surfers doesn't move purely randomly, but instead following probabilities based on how often previous surfers have followed certain links, the system should still converge to the correct PageRank values, but it could happen much more quickly.

You'll notice I used the word swarm up above. This way of looking at the algorithm gives it a striking similarity to many swarm intelligence approaches. Consider instead that we have ants crawing the web. As the ants traverse edges, they deposit pheromones on the edges, which then influence other ants to follow these same edges, dropping more pheromones on the way. This process continues, and the ants reach a sort of steady state. Looking at things this way, it's not too far fetched to call PageRank a swarm intelligence algorithm. Solving for the eigenvector of the adjacency matrix could be approximated by adding pheromones, and modeling diffusion and decay.

Now's the part where my advisor would ask the question "Okay, but what emerges?" This is an important aspect of swarm intelligence -- there has to be some sort of emergent behavior. The quick answer is "We computed PageRank!" However, we don't quite have any intelligence yet, because the ants are just wandering around the web aimlessly. Ants in the real world have a goal, which is to find food and bring it back to the colony. To really start having swarm intelligence, we need to give our web surfing ants a goal. The food they are looking for could perhaps be made analogous to the information the web surfer is seeking. Now the ants have a goal, and they can start laying pheromones intelligently to help other ants with a similar goal find the same thing.

Applying this to the real world could be problematic, however, because we have many surfers that are looking for almost as many different topics. To build off of pheromones from other surfers, someone would have to be looking for information on the same topic. A pheromone could be assigned for each keyword, but keeping track of all of this information would quickly become intractable.

To do: read up on Topical PageRank.