Reference Web vs. the Incremental Web: How the current discovery methods will break

Google searches the reference Internet. Users come to google with a specific query, and search a vast corpus of largely static information. This is a very valuable and lucrative service to provide: it’s the Yellow Pages.
On the other hand, Weblogs (which looks like yet another HTML page) are chronologically organized. The posts are structured data, well tagged and facilitate easy discovery. The ranking & indexing becomes easier in case of weblog. A search engine may assign higher rank to keywords appearing in the <dc:subject> or <title> tags compared to the content in <description> tag. Thanks to this tagging almost, the ranking scheme does not become somebody’s personal algorithm. Compare this to how Google assigns the magic rank to the non-structure web; More weight is given to words appearing in the HTML <title> tag or the the text of the links in the <a> tag (oversimplification here, Google does a tad more). Same scheme is applied to <H1>, <B> tags. The logic of doing this is obvious.
On the outset, the difference between regular HTML pages and Weblogs is not much. However, HTML can be read only with a browser while Weblogs can be read with the browser and other client-based (NewsMonster, Gush, etc.) and web-based (Bloglines, Feedster, etc.) applications. Thanks to standardized delivery medium like RSS or Atom, the Weblog could be read on any custom software or device.
Google works best on Reference web, the web, which is primarly, contained of HTML pages and the content is not tagged beyond the ones required for rendering the HTML markup. Try searching on Google for the latest conversation on Java. The top site is from Sun. On a different twist try searching for some help on formatting/parsing a java.util.Date object–The search result references the discussion around the deprecated APIs. This is the reference web–here the content does not say what it is and what it refers to. It’s the search engine’s algorithm, which decides how to cut, chop and present.
Contrast this to the incremental web–The content says what it is, what categories it belongs to and when it was published.
I think this is an immense opportunity, some of which is being addressed by Topix, Technorati, Feedster, etc. But, Weblog searching is still in infancy. Using the traditional search techniques–the wheat (the blog entries I want to read) and the chaff (the blog entries I want to avoid) are going in the grind together.
On a grand scheme of things, I think we are on the path to the Semantic Web.

This entry was posted on Wednesday, February 16th, 2005 at 10:45 am and is filed under Search. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.