Monday, July 18, 2005

The SearchActive Idea

While I'm probably going to talk mostly about peer-to-peer networking on this blog, the key idea behind SearchActive is basically a swarm intelligence solution to searching the Web. In a nutshell, swarm intelligence involves studying the behavior of social creatures and then applying these behaviors to solving other problems. Most of swarm intelligence approaches are based on social instects, like ants, bees, and termites. There has also been some work in studying the flocking behavior of birds. Taking ants for example, they basically do two things, they look for food, and then when they find it they carry the food back to their nest. While they're carrying food back, they leave a trail of pheromones, which other ants tend to follow. In essence the ants solve the shortest path problem. Anyone who's played a lot of real time strategy games has probably experienced the difficulty computers seem to have solving this same problem.

The analogy for searching the Web is closer to people wandering through town looking for something. Let's say our good friend Bob is looking for an ATM machine, and he's new to town. He starts out walking along, and he sees someone walking by. He asks that person if they know where the nearest ATM is, and preferably one from his home bank so he can avoid paying ATM fees. This person doesn't know where that particular ATM is, but he does know where an ATM is, so he points Bob in the right direction and off he goes. For Bob this isn't ideal, but it's better than nothing, so he goes with it. Along the way, he sees another person walking down the street, so he asks this person the same question. This person tells Bob where an ATM is, and even one that's from the right bank, but the trouble is it's a couple blocks further. Bob has to then decide which is the better result based on whatever factors he thinks is important, such as the distance, how trustworthy either of these people are, and how much it's worth to him to not have to pay an ATM fee.

Anyway, later on, Bob decides he needs to find a drug store. Bob remembers passing one on his way to the ATM earlier that day, so he starts out heading the same direction. Along the way he asks some of the people he passes if there's a better one. He comes across one man who happens to be the owner of a drug store that's known for having a very poor selection and ridiculously high prices. This man obviously wants to improve his business, so he explains to Bob that this is by far the best drug store in town, and that it's just a block to the east. He was actually lying, as his store is a good three miles to the east. Anyway, Bob, not knowing any better decides to try his luck. After going about a block the drug store is no where to be found, so he asks someone about it. This person responds that the store is actually a long way away and it's not really worth going to anyway, and recommends a couple of better drug stores to Bob. One of these is the one Bob was heading for in the first place, so he decides to go back to that one.

SearchActive will use a similar method to search the Web. First let's ignore the peer-to-peer aspects and just look at one it would work for one computer in isolation. SearchActive will be a program that runs on your computer and monitors all the web pages you visit. It assumes you can find at least one without a search engine, which isn't that unreasonable because everyone at the very least needs to be able to find a search engine. When you go there, SearchActive catches the page, decides what are the keywords on the page, and stores it in its index. Then, in the background while you're busy reading your web page, SearchActive is following the links on that page and adding the linked pages to its index. When you click on a link, your click takes priority over anything else it's doing, so it goes and grabs the page and gives it to you, and then proceeds to go ahead and index that page. If you're lucky, the link you clicked is one that SearchActive already indexed, so it can give you a local copy instead, which makes things seem a lot quicker for you because it hides the download time.

When you want to make a search, you give some keywords to SearchActive and it checks its index for pages it has seen that might be a good match. The process continues and before long SearchActive is able to give you pretty good search results.

Now, there needs to be some way to rank these pages, since most of the web pages don't have that great of content. I'll talk about some ideas I have for this in a later post. This is really where the hard part of searching the web is.

Now lets take your own computer running SearchActive and connect it to a few other computers running SearchActive (hereafter called nodes) which are in turn connected to a few other nodes, and so on. Now when you go to search, SearchActive not only looks at its own index, but also asks its neighbors if they know of any good pages for the keywords you gave. Each of the neighbors can either answer, or ask their own neighbors too. In essence, when each node responds it is voting for a particular page. Your particular node should decide how much each vote can count and then combine them all to figure out what the best page is. In essence you're deciding how much you trust certain nodes.

As certain nodes gain your trust in the future you'll be more likely to ask them first. You also have the option of breaking some connections and connecting to new nodes, so ideally your node would favor connections with highly trusted nodes. Since the trust should be based on the quality of search results, in essence the network should organize itself so that nodes run by users with similar interests tend to get closer together. In essence you'd end up with different regions of the network specializing in specific topics on the Web.

That's the idea behind it, but of course the devil is in the details. Future posts will likely discuss some ideas for dealing with some of these details.

Saturday, July 16, 2005

About this Blog

In the same vein as many other developers who have started having publically available blogs to discuss the progress they are making, I've created this blog as a place for me to discuss the work I'm doing on peer-to-peer web search.

The idea for this project came from a paper written by Paul De Bra and Reinier Post on the Fish-search, which was a client side Web search engine. It had a lot of good ideas, but in my opinion they were about 10 years ahead of their time. The Web is quite a bit different today than it was then, which means there are different constraints and goals in attempting a client side search.

I plan to use this blog to detail the progress I'm making on my thesis and discuss some ideas I've had, both for my own benefit, and for anyone else who happens to be interested in this. It will probably have a lot of ramblings about various other peer-to-peer systems, different challenges I'm trying to overcome, interesting things I've read, and maybe announcements about other related projects that I find.

My thesis is primarily going to discuss how a distributed peer-to-peer web search engine could be accomplished, how it would deal with issues that peer-to-peer networks must deal with, and what would be the reason for doing it in the first place. I also hope to actually demonstrate this is possible with a program I've decided to call SearchActive.

SearchActive will be a practical implementation of the ideas I discuss in my thesis. The thesis will probably be relatively abstract, while SearchActive will let me get down into the guts of putting this thing together. It will also give me something to test the actual behavior of the system. I am hoping to have at least a proof of concept ready by around January that I will let some of my friends use to test things out. Hopefully I will be able to conduct a largers scale test and include the results in my thesis. Once all this is done, I'm planning to release the project as an open source project, and hopefully a few other developers will join me. In that case, I will invite other people to post on this blog too so we can have several viewpoints and aspects on the development of SearchActive. Of course that's looking a ways ahead, so don't hold me too any of that.

So to sum it all up, this blog will be a place to post information about the progress on SearchActive, as well as discuss some of the more theoretical foundations of the project.

About Me

My name is Eric Holk. I am Computer Science and Mathematics student at Rose-Hulman Institute of Technology. During my senior year, which officially starts September 1st, I'm going to be doing a thesis on peer to peer web search. The title I'm currently working with is "Decentralized Peer-to-Peer Web Search." Even though the school year doesn't start for another month and a half, I've already been doing some background reading and plenty of thinking about this project. I'll post more details about what I hope to accomplish with my thesis later, but this entry is supposed to be primarily about me.

I've spent most of my life growing up in the Seattle area. Rose-Hulman is in Indiana, so I spend most of the year out there these days. For the summer I have an internship at Sandia National Laboratories. The internship is going very well, and I'm having a lot of great experiences. I'm becoming relatively fond of Albuquerque as well.

Some of my hobbies include mountain biking, backpacking, camping, and snowboarding. Honestly, I don't get the chance to do these things very often anymore, but I keep hoping that I'll get back into it eventually. I'm also a pretty decent bass player, and I play some guitar as well. As far as my religious beliefs go, I'm a Christian, and an active member in the Church of Christ.

One of my extracurricular activites at school is the RHIT Robotics Team. Our goal is to build an autonomous helicopter that can complete the missiong required by the International Aerial Robotics Competition. I'm currently the project manager of the team, but for the summer I've passed many of my responsibilities on to Jacob Krall. Besides managing things and doing administrative stuff, I work on the flight controller, the ground station/helicopter networking code, and the simulator.

My main interests in Computer Science seem to be related to artificial intelligence. I took a course winter quarter in Swarm Intelligence, which I absolutely loved. The course is actually where I got the inspiration for this project. My other interests include graphics, programming languages, and probably a few other things. As far as how I actually use my computer, I'm a Gentoo guy. Of course I use Windows too, but I'm using that less and less as time goes on. Probably the biggest draw about Linux for me is how much of the guts are exposed, which lets me tinker around a lot more with it. With Linux I can make what in my opinion are some pretty creative solutions to various challenges, while under Windows I seem to be mostly restricted to just using the computer. Another interest is peer-to-peer networking, which shouldn't be surprising as I'm going my senior thesis on that. Of the peer to peer projects I've seen, I think the most interesting from a social and technical standpoint is the Freenet Project, although last time I checked the lag was so high that it takes a lot of patience to actually use it. One of the really fascinating things about peer-to-peer networking to me is how it intertwines technical and social issues. For example, Freenet is a technical way to fight censorship, which leads one to several moral and ethical issues to wrestle with. You have to decide whether you want to accept responsibility for creating a system that could aid rebels in overthrowing a government, or creating a system that's very use in some places is illegal and could cause people to be put in prison, or even put to death. Luckily my specific addition to the world of peer-to-peer networking should be able to avoid most of these issues.

Anyway, I think that's enough about me. Thanks for reading, and I hope to provide a lot of information on this blog that should be useful to someone somewhere.