{"id":51,"date":"2005-02-27T21:39:10","date_gmt":"2005-02-28T03:39:10","guid":{"rendered":"http:\/\/www.khaitan.org\/blog\/2005\/02\/writing-a-feed-crawler-in-java\/"},"modified":"2005-02-27T21:39:10","modified_gmt":"2005-02-28T03:39:10","slug":"writing-a-feed-crawler-in-java","status":"publish","type":"post","link":"https:\/\/www.khaitan.org\/blog\/2005\/02\/writing-a-feed-crawler-in-java\/","title":{"rendered":"Writing A Feed Crawler in Java"},"content":{"rendered":"<p>Thanks to my current day job, I haven&#8217;t written any serious (Java) code for a long time. Couple of weeks ago, I was mulling over to overcome this loss of habit&#8211;more so trying to refresh the skills. So, I started writing a crawler and a harvester for the RSS\/Atom feeds. It may not be a big deal, but I never wrote one before. It became interesting when I fired-up Eclipse. The harvester returned 280 blogs on my second-degree of separation. If your weblog stats have a sharp uptrend for the last 10 days then you know who to thank.<br \/>\nI picked up <a href=\"https:\/\/rome.dev.java.net\/\">ROME<\/a> as the feed parser of choice. ROME has couple of bugs, which I&#8217;ll be sending to the developers. The nastiest one is its inability to parse atom feeds from blogspot.com. I may know the solution, but need to go through the ROME source code in order to fix\/suggest solution.<br \/>\nThe harvester\/crawler source uses simple <a href=\"http:\/\/java.sun.com\/j2se\/1.4.2\/docs\/api\/java\/net\/HttpURLConnection.html\">java.net.HttpURLConnection<\/a> for HTTP transfers. The next step is to make it work over <a href=\"http:\/\/java.sun.com\/j2se\/1.4.2\/docs\/guide\/nio\/\">Java NIO<\/a> in order to &#8220;up&#8221; the performance of network I\/O for frequent updates and large set of feeds.<br \/>\nWriting the code was easy once the hands got dirty&#8211;The big task is to figure out what to do with this code. How about <i>me-too<\/i> of <a href=\"http:\/\/www.bloglines.com\/\">Bloglines<\/a>, <a href=\"http:\/\/www.technorati.com\/\">Technorati<\/a> or Kinja?<br \/>\nHello, to the world of <i>raw<\/i> structured data and it&#8217;s various formats viz. RSS 0.90\/ 0.91\/0.92\/0.93\/0.94\/1.0\/2.0\/atom 0.3 (including the <a href=\"http:\/\/web.resource.org\/rss\/1.0\/modules\/standard.html\">standard<\/a> &#038; <a href=\"http:\/\/web.resource.org\/rss\/1.0\/modules\/proposed.html\">proposed<\/a> RSS modules)!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Thanks to my current day job, I haven&#8217;t written any serious (Java) code for a long time. Couple of weeks ago, I was mulling over to overcome this loss of habit&#8211;more so trying to refresh the skills. So, I started writing a crawler and a harvester for the RSS\/Atom feeds. It may not be a [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/posts\/51"}],"collection":[{"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/comments?post=51"}],"version-history":[{"count":0,"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/posts\/51\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/media?parent=51"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/categories?post=51"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.khaitan.org\/blog\/wp-json\/wp\/v2\/tags?post=51"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}