Wednesday, April 28, 2004

Googlebot taking "random stabs" at Atom/RDF Files

In an article issued by Garrett French at WebProNews he reported that GoogleBot's been requesting non-existant files from root directories lately, leading some to suspect that Google's planning a blog search addition to their results, possibly even a new link above the search box.



Matthew Mullenweg's (Photo Matt blog) report was the basis of the article in which he said that he has been getting random requests from Googlebot for atom.xml and index.rdf files on his blog site and others. Other random activity was also reported by Mullenweg such as issuing calls to non-existant subdirectories, usually /blog or similar. Since the sites run WordPress and there is no mention of or links to atom.xml or index.rdf anywhere, then he is assuming that Googlebot is guessing that these files will be there.



This activity does seem to come on the heels of My Yahoo's recent inclusion of RSS feeds that can now be added to any personal pages. Yahoo Search is also featuring RSS Feeds in their results along with an easy to click option for adding those feeds to My Yahoo. Yahoo Search does not include feeds for Atom which is the XML standard for Blogger.com which is owned by Google.



Dave Winer, a proponent of RSS, is crying foul and almost accusing Google of anti-trust illegalities and a PR disaster in the making. Winer says, "I never in a million years thought Google would stoop this low, even Microsoft on its worst day never played this dirty."



In a follow up comment, Greg R. of Ten Reasons Why rationally states "A simple, non-conspiratorial explanation is that googlebot is attempting to find RSS 1.0 and Atom feeds on sites that don't have those feeds linked."



Greg adds, "Assuming Google wants to index all feeds (and it doesn't appear that Google has ceased indexing RSS 2.0 feeds), if googlebot came across a site that had an RSS 2.0 feed, it makes sense to take a random stab at the common RSS 1.0 and Atom URIs to see if those feeds are also present. (Since some tools, like MT, generate them automatically, I find that people will quite frequently still have a live Atom feed on their site, even if they've only linked the RSS 2.0 page.) This theory is supported by PhotoMatt (http://photomatt.net/archives/2004/04/20/google-cooking/) who points out that when searching for index.rdf and atom.xml, googlebot is always checking the root directory or any obvious sub-directory (like /blog). That indicates to me that they're just scouring for feeds by looking in the obvious locations."



As plausible as this sounds, I tend to agree with it (for now anyway). What is intriguing about this is that Google never takes "random stabs" at anything. This is by all rights a unique event in Googlebot crawling history. It is unprecedented in fact. But these are not the same situations that Google was in a year ago, they did not have Yahoo (and MSN) breathing down their necks. Anyone of them could topple Google, both of them together just might do that.



Interesting thought to ponder for right now rather than conspiracy theories -- random stabs. Now that is something to talk about.

No comments:

Post a Comment