Fishing for Data

by Paul A. Strassmann

Computerworld

December 4, 2000


Knowing something about fishing helps to understand what you get when you click on AltaVista, Google, Ask Jeeves or any other popular search engine. First, you never search directly for what's on the Web. Of the estimated 550 billion documents floating in cyberspace, you get a chance to explore only about 1 billion. You "fish" in compartments that each search engine has set up, making it possible to easily retrieve your "catch" in seconds.

What you can find each time you cast your line is limited by the extraction and retention techniques that give you simplicity over thoroughness, meaning any findings will be superficial. What you get is what has already been found by a search engine "spider," such as pages logically linked to other pages. If others haven't shown interest in the documents you're looking for, the chances of those pages showing up in a search aren't good.

The authors' eagerness to be noticed has led to tricks for tuning Web pages to exploit the idiosyncrasies of proprietary search methods. For example, I get offers every week to dress up my own Web pages so that some searches will favor my content over that of my competitors.

Second, you'll never find out how complete or reliable a search engine's findings are. The boasts about unique "crawling" methods are only promises, because proof is always missing. Vendor-specific Web-exploring software decides what's revealed and what remains hidden. In rare instances, a search engine will locate only the catch that's further enhanced by an editorial staff that sorts and classifies the subject matter. For instance, one of the most frequently visited Web sites, Yahoo, employs more than 1,000 indexers who place Web content in predefined categories.

Third, existing search engines work off the flawed assumption that everything must be searchable in one pass. Consequently, you're encouraged to inquire by using only a few keywords. In some of the more sophisticated searches, a two- to four-word phrase is also allowed. Occasionally, you're offered "advanced" options where you can call for some elementary logic that would sort what to include from what to exclude.

Unfortunately, this doesn't work well in cases where the choice of a popular term, such as "Web," yields more than 30 million results, in no particular order. The existing search engines don't let you play a game of 20 questions, which would make it possible for an inquirer to interact sequentially with databases. In that game, when an ambiguous question is asked, it usually takes a few tries before the query's correct meaning is understood. Then, it may take repeated give-and-take before you get a meaningful response. No current search engine engages in anything that would look like an intelligent exploration of what's being sought.

The Web represents the most awesome accumulation of data ever. But converting it into knowledge capital requires better retrieval methods. The shallow techniques of commercial search engines may satisfy casual employee inquiries, most journalists and certainly all politicians, but they won't serve the needs of comprehensive commercial intelligence. As these engines are deployed as elements of corporate business intelligence, their simplistic inadequacies become dangerous because they may not find information on certain critical events that can make a difference in one's research.

If the goal of your knowledge management program is to retain the relative simplicity of existing keyword searches, most of your employees will waste time browsing the Web and ending up with incomplete findings. If you invest serious money in enterprise-specific knowledge repositories such as sophisticated information-mining software, you'll still need more tools to fully explore the Web.

Start planning to buy search software that allows an ongoing discourse between those who ask questions and those who answer. Finding, correctly identifying and then deploying superior information intelligence will be a decisive weapon in tomorrow's world of business competition.


Strassmann (paul@strassmann.com) believes information competition is the extension of economic competition that has been pursued by material means.


Copyright 2000 by IDG Communications, Inc., 500 Old Connecticut Path, Framingham, MA 01701.
Reprinted by permission of Computerworld

Go back up to the Strassmann, Inc. home page.