OpenSource search and competitive advantage
A recent posting by Scoble postulates on how a Microsoft developers are using Google heavily to research programming references
When I visit Google there's a huge plasma screen that shows every Google search done in real-time (it only shows that a search was done, not what the search was about). Everytime I look at that screen Redmond, WA does more Google searches than most other large cities in the world and does more Google searches than the entire continent of Africa.
This reminded about a story of a free service for patent search. One large corporation (probably IBM) put a free patent search database on the web. Engineers from another large corporation (probably Xerox) were told not to use it because if the competition knows what you are searching for they might beat you to a patent application.
Aggregating what people are searching (working on/thinking about) generally, or in a specific location, must give you competitive advantage - a market researcher's dream come true perhaps. All that and it also has a nice paranoia value for the competition.
So what choices do you have if you want to build a web spider and index a subset of the Internet that you want to research privately - one possible free software framework is Apache Nutch - a web spider extension to Lucene Java. Some engines built on Nutch are listed on the Nutch Wikipedia page - these include Krugle for code search and Greener is one for searching for green products.
I looked at Nutch some time back. The documentation is a little light so have a look at Wikipedia for a primer on web crawlers. Once installed you seed Nutch with a set of urls from some source (e.g. a directory like dmoz.org) and it creates a collection of segments of the crawled sites it can reach from the seeded urls. It then indexes the segments. It can handle a variety of documents including Microsoft Powerpoint. It has a Google-like web interface to search your crawled sites (see Greener for example).
I had figured you would get something like an rsync-like copy of the site's directory tree but you have to use Nutch's API to access at the documents directly.
The guy who created Lucene and Nutch, Doug Cutting, now works for Yahoo! on his OpenSource projects.
One startup using Nutch is Krugle. There is some analysis of their business model - Krugle on GigaOM. While Krugle looks like a useful service it doesn't have the simplicity of the Google interface. As my web design friends might say, Krugle instead went all "tab-tastic", with each search getting it's own tab. Google might use a keyword as part of their search to achieve the same result.
Just to see what code Krugle had released I searched for "Krugle" using krugle.com and only found one link to Krugle from the AWStats package. I did the same search on Google Codesearch which returned pages of results. It turned up lots of code for one of the Krugle co-founder, Ken Krugler. When I put "Krugler" into Krugle and I got Ken's source too. Interestingly, this time, Krugle did find a podcast on Ken Krugler on search-driven development. The podcast is not code, so you can forgive Google for that, but it does show how Krugle includes more general programming resources. Horses for courses I suppose.
Scoble was pitching that Microsoft should buy Krugle in his post above as they are better than the existing Microsoft offerings in the code search area. Krugle argue that search is fundamental to what modern developers do. Engineers are good at just in time comprehension where they figure out an API at the point where they have to do something with it.
Search is part of many people's creative cycle and effective search can give competitive advantage. Krugle shows that Nutch can be the basis for a useful search service. So there is a free software option if ever you need to do some private research without being geolocated up there on someone's "huge plasma screen".