Friday, February 17, 2006

Overview of Last Quarter

I noticed that I've neglected to update this blog for the last three months or so. That means I've gone an entire quarter of the school year and not said anything. It's been a busy quarter, but I figured I should update everyone on what I've accomplished.

I formalized some of the ideas behind the search algorithm I hope to implement. Then I built a web crawler and crawled about 12,000 web pages. There are about 300,000 more pages the crawler has found but hasn't managed to visit yet. It's not terribly robust at the moment. I haven't had a chance to look at the web site information yet, so I can't comment on that much. The plan is to use this in simulations so the graph topology actually represents a real subset of the Web.

I started running some simulations. Basically I generate a graph with certain interesting properties, then I have a swarm of simulated agents go start randomly following links in search of something. While I can't say that this will be a good approximation of actual human browsing habits, it will hopefully give some insight as to whether this approach has any hope of working at all. I've had some interesting results so far, but again I haven't analyzed them enough to report anything truly useful. It looks like the amount of work I expect each node to be able to do might be a bit high, so some serious optimization of my algorithms is going to be needing.

As far as what's coming up, I'm going to take a break to write up what I've done for my thesis advisor. Hopefully this will also include analyzing some of the data I've generated so far. I have a poster presentation the second week of March. I think my focus until then is going to be mostly on simulation, and then I will start doing serious work towards getting a prototype done. Hopefully now on I'll have more time to keep this blog up to date as well.

Tuesday, October 25, 2005

What do search engines solve?

The obvious answer to this question is "Search engines solve the problem of finding information." This makes sense to most people, but from a technical standpoint it is difficult to evaluate. What information is the user trying to find? Arguably any page on the web contains information, so a search engine would not be necessary, since most users can find a web page without using a search engine. In fact, the search engine itself is a web page, so by finding the search engine the user has already accomplished the goal.

Now, most people will recognize quickly that your average web surfer doesn't want to find just any information, the surfer is looking for one particular piece or type of information. Let us assume that the surfer is looking for the answer to some question. This won't always be the case, as some people simply browse the web just hoping to find something interesting, but for the purposes of designing a search engine it's reasonable to assume that the user is looking for something.

So, search engines help the user answer a particular question by finding a page that answers that question on the web. Another logical thing to ask is which search engine is the best? We could rank a search engine by just whether the user was successful in their search---did they ever find the answer to the question they were seeking to answer? This simply gives a yes or no answer, but we could look at the percentage of successful searches over a long period of time to decide which search engine is the best.

Whether the user eventually finds what they are looking for or not is not all we should consider in ranking a search engine. Users could probably eventually find the information without ever using a search engine, so using a search engine doesn't offer any improvement using only this criteria. Instead we want to measure how well a search engine helps a user answer the question. A logical measure of this is the time it takes to find the answer. Time isn't terribly reliable though, because some users read slower, or they get distracted. A better way would be to count the number of hyperlinks followed or pages visited to find the answer. We could refer to this as the path length.

Measuring the average path length to find the answer to a question for a particular search engine seems to give a good metric for evaluating a search engine. This gives a way to test "how good are the search results?" which is a very vague question and hard to answer quantitatively. So, to look at our original question, "what do search engines solve," it looks like we can say that search engines attempt to minimize the path length to answer a question.

This is useful to me, because it gives me a target for my project, and a way to objectively compare my results against other search engines. The solution is now testable, and also allows me to more precisely answer questions like "what emerges" when using a swarm intelligence approach.

Notice that the problem of minimizing path length to an answer has other solutions besides search engines. The focus of this is more on the user experience, rather than the quality of the search engine, which is definitely where the focus should be. It also means that I no longer have to be limited to a search engine to solve this problem, but I can apply swarm intelligence in another way to optimize the user experience.

It seems like I should design an experiment to test search engines and gather some data to compare the project I eventually produce against.

Sunday, October 02, 2005

PageRank, Markov Processes and Swarm Intelligence

Since no endeavor in web search would be complete without considering Google PageRank, I've been studying the PageRank algorithm today. It overlaps nicely with the Linear Algebra class I am taking, where we have been assigned a paper about an application of linear algebra. As many web developers and people interested in internet search probably know, PageRank attempts to measure the global importance of pages on the Web. It does this by looking at the way links are distributed, counting each link as a vote. If there is a link from Page A to Page B, the author of Page A apparently thinks there is something important about Page B, so the link reflects this. PageRank determines the importance of Page B by looking at all the pages that link to Page B. The amount of importance each page transfers along its links is proportional to the importance of the page. This gives the algorithm a recursive nature.

The algorithm can be reformulated in several ways. If we look at the web as a graph, we can build a normalized adjacency matrix. An adjacency matrix is one where each element, (i, j), represents whether node i is connected to node j. This does not need to be symmetric, node i connected to node j does not imply that node j is connected to node i. In the adjacency matrix, the element (i, j) is 1 if there is a link from Page i to Page j, and 0 otherwise. However, we are looking for a normalized adjacency matrix, and in this case it means each of the columns has been scaled so the sum of all entries in a column add up to 1. This means we have a Markov matrix, which will allow for some other cool things later on. Using the normalized adjacency matrix, we can reformulate the algorithm as a matrix equation. In essence we are now solving for the principle eigenvector of the adjacency matrix.

This adjacency matrix is a a Markov matrix. This means we can interpret each element as representing the probability that a surfer will go to Page j given that the surfer is already at Page i. If we square the matrix, we can find that probabilities that the surfer will get from one page to another in two steps. If we continue this process, eventually the matrix settles into a steady state, which represents the probability that a surfer randomly clicking links will ever get to a particular page. This is basically what PageRank measures.

Looking at things this way gives another way of calculating the PageRank, and some things I've read suggest this is actually the way Google does the calculation. PageRank can be calculated by simulating random surfers starting and random pages and following links randomly, and then looking at how often a surfer arrives on a particular page. If this swarm of surfers doesn't move purely randomly, but instead following probabilities based on how often previous surfers have followed certain links, the system should still converge to the correct PageRank values, but it could happen much more quickly.

You'll notice I used the word swarm up above. This way of looking at the algorithm gives it a striking similarity to many swarm intelligence approaches. Consider instead that we have ants crawing the web. As the ants traverse edges, they deposit pheromones on the edges, which then influence other ants to follow these same edges, dropping more pheromones on the way. This process continues, and the ants reach a sort of steady state. Looking at things this way, it's not too far fetched to call PageRank a swarm intelligence algorithm. Solving for the eigenvector of the adjacency matrix could be approximated by adding pheromones, and modeling diffusion and decay.

Now's the part where my advisor would ask the question "Okay, but what emerges?" This is an important aspect of swarm intelligence -- there has to be some sort of emergent behavior. The quick answer is "We computed PageRank!" However, we don't quite have any intelligence yet, because the ants are just wandering around the web aimlessly. Ants in the real world have a goal, which is to find food and bring it back to the colony. To really start having swarm intelligence, we need to give our web surfing ants a goal. The food they are looking for could perhaps be made analogous to the information the web surfer is seeking. Now the ants have a goal, and they can start laying pheromones intelligently to help other ants with a similar goal find the same thing.

Applying this to the real world could be problematic, however, because we have many surfers that are looking for almost as many different topics. To build off of pheromones from other surfers, someone would have to be looking for information on the same topic. A pheromone could be assigned for each keyword, but keeping track of all of this information would quickly become intractable.

To do: read up on Topical PageRank.

Tuesday, September 13, 2005

Framing the Problem

First of all, I'd like to apologize for being so delinquent in my updates. My summer turned out to be pretty busy towards the end (that's my excuse anyway) and it definitely takes a while to get back into the swing of things when the school year starts up. The school year is looking good so far. My classes are interesting so far, and my schedule allows me enough time to fit in most of what I need to fit in. I'm starting some more intense work on my thesis, which should mean there will be more frequent posts here. I've been meeting weekly with my advisor, and he's doing a really good job giving me some direction for my project.

The current thing my advisor is wanting me to do is to coming up with a clear working statement of the problem. I agree this is a good idea, since that will give me a good way to focus my ideas. I wrote up a simple one earlier today, which started out with "can web search be done successfully in a decentralized manner." Now, what do I mean by decentralized? Google could argue that their search is decentralized, since they have their army of robots and hundreds of computers that crawl the web and index most of what they come across. This is decentralized in a sense, but it's not what I mean. Google seems centralized to the user because everything is wrapped up behind www.google.com and only Google controls what search results it will return. I would like to see if the problem of searching can be successfully distributed out to the individual web users, so that each computer running a web browser is participating in building the index.

This leads me to the question of how to frame the problem. I'm looking at an extremely decentralized architecture for doing web search. I can see two different ways to look at this problem, either as a swarm intelligence problem or a peer-to-peer problem. As a peer-to-peer problem this becomes looking at building a peer-to-peer network for searching the web. This naturally directs one more towards the issues facing peer-to-peer networks, such as trust, accountability, and security. As a swarm problem, we would be looking at this as lots of agents out searching the web (i.e. people using web browsers to look at information) that then start cooperating to make everyone more efficient. Swarm intelligence problems look at how many seemingly independent agents that are relatively simple on their own combine and have a very distinct group behavior. In the context of web search, it would be looking at people browsing the web just having a few address to start from and then following links everywhere, versus if these people we to start communicating and make everyone more efficient by sharing previous experience.

So which aspect would be most useful? I think the swarm aspect brings out part of the problem that are more interesting to me. I also think swarm intelligence and peer-to-peer are very closely related, although it seems like relatively few people are looking at peer-to-peer networking from a swarm intelligence perspective. Peer-to-peer networks involve lots of computers or nodes on a network that all participate in accomplishing some particular goal. For example, let's look at Gnutella. Gnutella is a decentralized way for sharing files. If it were just one computer, there would be a very small selection of content available. On the other hand, by combining these nodes into a network, one searches the whole network rather than the individual computers, and as a network there is a much wider range of information available. Thus we have the behavior of individual nodes and the behavior of the network as a whole as two very distinct things. This is exactly what swarm intelligence is about.

So peer-to-peer seems like it can be cast as a swarm intelligence problem, which then opens it up to all the insights and approaches of swarm intelligence. Swarm intelligence can also gain something from what the peer-to-peer people have been developing. All the swarm problem solving approaches I know of assume everyone in the swarm is behaving well. In a peer-to-peer network, this is almost never the case. Using swarm intelligence in peer-to-peer networks will force swarm intelligence to develop means of dealing with misbehaving members of the swarm.

Both aspects seem valid, and they have quite a bit. I think for the most part I will consider this as a swarm intelligence problem, and deal with the peer-to-peer issues from a swarm intelligence standpoint, at least as far as it makes sense to do that.

Monday, July 18, 2005

The SearchActive Idea

While I'm probably going to talk mostly about peer-to-peer networking on this blog, the key idea behind SearchActive is basically a swarm intelligence solution to searching the Web. In a nutshell, swarm intelligence involves studying the behavior of social creatures and then applying these behaviors to solving other problems. Most of swarm intelligence approaches are based on social instects, like ants, bees, and termites. There has also been some work in studying the flocking behavior of birds. Taking ants for example, they basically do two things, they look for food, and then when they find it they carry the food back to their nest. While they're carrying food back, they leave a trail of pheromones, which other ants tend to follow. In essence the ants solve the shortest path problem. Anyone who's played a lot of real time strategy games has probably experienced the difficulty computers seem to have solving this same problem.

The analogy for searching the Web is closer to people wandering through town looking for something. Let's say our good friend Bob is looking for an ATM machine, and he's new to town. He starts out walking along, and he sees someone walking by. He asks that person if they know where the nearest ATM is, and preferably one from his home bank so he can avoid paying ATM fees. This person doesn't know where that particular ATM is, but he does know where an ATM is, so he points Bob in the right direction and off he goes. For Bob this isn't ideal, but it's better than nothing, so he goes with it. Along the way, he sees another person walking down the street, so he asks this person the same question. This person tells Bob where an ATM is, and even one that's from the right bank, but the trouble is it's a couple blocks further. Bob has to then decide which is the better result based on whatever factors he thinks is important, such as the distance, how trustworthy either of these people are, and how much it's worth to him to not have to pay an ATM fee.

Anyway, later on, Bob decides he needs to find a drug store. Bob remembers passing one on his way to the ATM earlier that day, so he starts out heading the same direction. Along the way he asks some of the people he passes if there's a better one. He comes across one man who happens to be the owner of a drug store that's known for having a very poor selection and ridiculously high prices. This man obviously wants to improve his business, so he explains to Bob that this is by far the best drug store in town, and that it's just a block to the east. He was actually lying, as his store is a good three miles to the east. Anyway, Bob, not knowing any better decides to try his luck. After going about a block the drug store is no where to be found, so he asks someone about it. This person responds that the store is actually a long way away and it's not really worth going to anyway, and recommends a couple of better drug stores to Bob. One of these is the one Bob was heading for in the first place, so he decides to go back to that one.

SearchActive will use a similar method to search the Web. First let's ignore the peer-to-peer aspects and just look at one it would work for one computer in isolation. SearchActive will be a program that runs on your computer and monitors all the web pages you visit. It assumes you can find at least one without a search engine, which isn't that unreasonable because everyone at the very least needs to be able to find a search engine. When you go there, SearchActive catches the page, decides what are the keywords on the page, and stores it in its index. Then, in the background while you're busy reading your web page, SearchActive is following the links on that page and adding the linked pages to its index. When you click on a link, your click takes priority over anything else it's doing, so it goes and grabs the page and gives it to you, and then proceeds to go ahead and index that page. If you're lucky, the link you clicked is one that SearchActive already indexed, so it can give you a local copy instead, which makes things seem a lot quicker for you because it hides the download time.

When you want to make a search, you give some keywords to SearchActive and it checks its index for pages it has seen that might be a good match. The process continues and before long SearchActive is able to give you pretty good search results.

Now, there needs to be some way to rank these pages, since most of the web pages don't have that great of content. I'll talk about some ideas I have for this in a later post. This is really where the hard part of searching the web is.

Now lets take your own computer running SearchActive and connect it to a few other computers running SearchActive (hereafter called nodes) which are in turn connected to a few other nodes, and so on. Now when you go to search, SearchActive not only looks at its own index, but also asks its neighbors if they know of any good pages for the keywords you gave. Each of the neighbors can either answer, or ask their own neighbors too. In essence, when each node responds it is voting for a particular page. Your particular node should decide how much each vote can count and then combine them all to figure out what the best page is. In essence you're deciding how much you trust certain nodes.

As certain nodes gain your trust in the future you'll be more likely to ask them first. You also have the option of breaking some connections and connecting to new nodes, so ideally your node would favor connections with highly trusted nodes. Since the trust should be based on the quality of search results, in essence the network should organize itself so that nodes run by users with similar interests tend to get closer together. In essence you'd end up with different regions of the network specializing in specific topics on the Web.

That's the idea behind it, but of course the devil is in the details. Future posts will likely discuss some ideas for dealing with some of these details.