Archive for the ‘Search’ Category

Microsoft’s desperate acquisition of aQuantive: What they could have done instead

Tuesday, May 22nd, 2007

Are 70,000+ Microsoft employees that useless? Or the 1,000+ Senior Management has no clue? Can’t believe that Microsoft would pay 85% premium and fork out $6bn dollars! They could have bought for that money and saved $500 million to acquire several smart startups in search and online advertising space. Microsoft could have successfully negated Google’s further foray into the growing on-demand business and may have brought raw energy to Redmond.
However, by the time, Microsoft figures out the monetization model for Search and online ad market, Google would start making significant inroads into the on-demand business and productivity applications. Why don’t they focus on solving problems? Well, it’s the big company syndrome; I’m a manager-I don’t solve problems-I get vendors/acquire companies. What a pity.
Does any B-school offer a course on intra-preneurship? Take a leaf from Google or even HP has a lesson to offer from their recent release of the NeoView Business Intelligence product.

Following The Pirates, er, pirates follow the market

Monday, July 17th, 2006

In 1996 from the footpaths of New Delhi near Red Fort I bought my copy of Bob Cringley’s Accidental Empires: How Boys of Silicon Valley Make Their Millions.. for INR 10 (around 25 cents in USD). That’s how free markets work. It looked like a local reprint (I still remember it was a Penguin publishing reprint).
John Batelle is excited about the popularity of his book, “The Search”, reaching the streets of Mumbai. In India the legal edition of John’s book is priced at INR 728, which is close to the Amazon’s price in US. If the cost is prohibitive, people will figure out a way to get access to it. That’s where pirates fill the gap.
And yeah, I also bought albums of Mariah Carey, Europe, Scorpions, Metallica, Madonna, AC DC, Guns ‘n’ Roses, Bruce Springsteen, U2, etc when none of the record companies were legally selling them in India in late 80s/early 90s 🙂 Markets figure them out.’s Blog Search Engine: Never too late for the fight

Saturday, June 3rd, 2006 finally rolled out it’s search engine for Blogs and joined the race with Technorati, Feedster, Sphere, Google, PubSub, Yahoo and at least 10 other upcoming search services for blogs. While Feedster’s demise has been predicted by Jeremy Z.,’s offering is a nice refresh.
There are actually two separate announcements — Blog search and enhancement of Bloglines’. Some obvious enhancements on Bloglines’ search interface, including the ability to view the full posts.

RSS is a Data Model not an API

Wednesday, September 7th, 2005

Just saw Nivi’s post on RSS is an API. Agreed RSS gives you access to a web site

Writing A Feed Crawler in Java

Sunday, February 27th, 2005

Thanks to my current day job, I haven’t written any serious (Java) code for a long time. Couple of weeks ago, I was mulling over to overcome this loss of habit–more so trying to refresh the skills. So, I started writing a crawler and a harvester for the RSS/Atom feeds. It may not be a big deal, but I never wrote one before. It became interesting when I fired-up Eclipse. The harvester returned 280 blogs on my second-degree of separation. If your weblog stats have a sharp uptrend for the last 10 days then you know who to thank.
I picked up ROME as the feed parser of choice. ROME has couple of bugs, which I’ll be sending to the developers. The nastiest one is its inability to parse atom feeds from I may know the solution, but need to go through the ROME source code in order to fix/suggest solution.
The harvester/crawler source uses simple for HTTP transfers. The next step is to make it work over Java NIO in order to “up” the performance of network I/O for frequent updates and large set of feeds.
Writing the code was easy once the hands got dirty–The big task is to figure out what to do with this code. How about me-too of Bloglines, Technorati or Kinja?
Hello, to the world of raw structured data and it’s various formats viz. RSS 0.90/ 0.91/0.92/0.93/0.94/1.0/2.0/atom 0.3 (including the standard & proposed RSS modules)!

Reference Web vs. the Incremental Web: How the current discovery methods will break

Wednesday, February 16th, 2005

Google searches the reference Internet. Users come to google with a specific query, and search a vast corpus of largely static information. This is a very valuable and lucrative service to provide: it’s the Yellow Pages.
On the other hand, Weblogs (which looks like yet another HTML page) are chronologically organized. The posts are structured data, well tagged and facilitate easy discovery. The ranking & indexing becomes easier in case of weblog. A search engine may assign higher rank to keywords appearing in the <dc:subject> or <title> tags compared to the content in <description> tag. Thanks to this tagging almost, the ranking scheme does not become somebody’s personal algorithm. Compare this to how Google assigns the magic rank to the non-structure web; More weight is given to words appearing in the HTML <title> tag or the the text of the links in the <a> tag (oversimplification here, Google does a tad more). Same scheme is applied to <H1>, <B> tags. The logic of doing this is obvious.
On the outset, the difference between regular HTML pages and Weblogs is not much. However, HTML can be read only with a browser while Weblogs can be read with the browser and other client-based (NewsMonster, Gush, etc.) and web-based (Bloglines, Feedster, etc.) applications. Thanks to standardized delivery medium like RSS or Atom, the Weblog could be read on any custom software or device.
Google works best on Reference web, the web, which is primarly, contained of HTML pages and the content is not tagged beyond the ones required for rendering the HTML markup. Try searching on Google for the latest conversation on Java. The top site is from Sun. On a different twist try searching for some help on formatting/parsing a java.util.Date object–The search result references the discussion around the deprecated APIs. This is the reference web–here the content does not say what it is and what it refers to. It’s the search engine’s algorithm, which decides how to cut, chop and present.
Contrast this to the incremental web–The content says what it is, what categories it belongs to and when it was published.
I think this is an immense opportunity, some of which is being addressed by Topix, Technorati, Feedster, etc. But, Weblog searching is still in infancy. Using the traditional search techniques–the wheat (the blog entries I want to read) and the chaff (the blog entries I want to avoid) are going in the grind together.
On a grand scheme of things, I think we are on the path to the Semantic Web.