Writing A Feed Crawler in Java

Thanks to my current day job, I haven’t written any serious (Java) code for a long time. Couple of weeks ago, I was mulling over to overcome this loss of habit–more so trying to refresh the skills. So, I started writing a crawler and a harvester for the RSS/Atom feeds. It may not be a big deal, but I never wrote one before. It became interesting when I fired-up Eclipse. The harvester returned 280 blogs on my second-degree of separation. If your weblog stats have a sharp uptrend for the last 10 days then you know who to thank.
I picked up ROME as the feed parser of choice. ROME has couple of bugs, which I’ll be sending to the developers. The nastiest one is its inability to parse atom feeds from blogspot.com. I may know the solution, but need to go through the ROME source code in order to fix/suggest solution.
The harvester/crawler source uses simple java.net.HttpURLConnection for HTTP transfers. The next step is to make it work over Java NIO in order to “up” the performance of network I/O for frequent updates and large set of feeds.
Writing the code was easy once the hands got dirty–The big task is to figure out what to do with this code. How about me-too of Bloglines, Technorati or Kinja?
Hello, to the world of raw structured data and it’s various formats viz. RSS 0.90/ 0.91/0.92/0.93/0.94/1.0/2.0/atom 0.3 (including the standard & proposed RSS modules)!

Comments are closed.