Sunday, May 27, 2007

Yahoo's Robots-NoContent: Another shade of NoFollow

Have you looked closely at Yahoo's new "robots-nocontent" tag? The tag was born from discussions at a recent Robots.txt Summit in which one topic focus included the adding of support for web pages to identify main content non-relevancy.

So what is considered non-relevant to the main content? Priyank Garg of Yahoo!Search summarizes it as navigation, menus repeated across the entire site, boilerplate text, or even adverting. So, basicly everything outside of your main content area is non-relevant.
This tag is really about our crawler focusing on the main content of your page and targeting the right pages on your site for specific search queries.
Actually the term "tag" is a misnomer. Yahoo's proposal is that you implement this with a class attribute (ie: class="robots-nocontent") for any content that is extraneous to the main unique content of the page. The Slurp help page cites numerous examples of how to apply this class to your Html -- too numerous.

Maybe I am not getting this. Wouldn't it just be simpler to identify what is the main content? Since it is apparent that Slurp cannot distinguish repetitive tendencies across the pages of one domain, then let's spell it out for them with a class="robots-this-is-content-dummie".

It would be a lot simpler to implement also. Just wrap your content with DIV element and you are done. You might have a section on the sidebar for "related" reference purposes, then wrap one around that too.

Give them their own tag <robots>


The usage of the class attribute is all wrong anyway. This is more of a relation than a class, more aptly it is meta-information which can aid the robots in distinguishing relevant content passages from the fluff.

To date, we have two methods of conveying information to robots on what not to index -- the robots.txt file and the meta noindex in the head section of our pages. Why not their own tag element for meta information inside the body of the page?

My suggestion is to use a new tag -- <robot attr></robot>. Inserting an element in your page that the browser does not recognize, will not be displayed to the viewer. For all intents and purposes, it is invisible except to the spiders.

The new robot tag could have an attribute of their own choosing. For instance, the Yahoo attribute would simply be slurp="directive", Google would be googlebot="directive", etc. Directives could be "content", "content-related" and (ugggh) "no-content" -- but there would not be a need for the latter, now would there?

Another possibility is to keep in line with the other methods of communication -- attributes of Index and Follow could be used. The attribute of Rel (relation) could be used also in the form of "main", "related", or "plagiarized". I made that last one up to see if you were awake or not. But when you think about it -- quotations or full passages lifted from other sites should not qualify as content.

There is a need to have this form of control at our disposal. The method that is used needs to be thought out a little better. Yahoo's premature birth of their new baby was not even close, it was slipshod and messy.



Technorati Tags: , , , , ,

3 comments:

  1. Very interesting read. Would love to hear Yahoo's take on this.

    ReplyDelete
  2. Not only Yahoo, it would be great to hear from others as well.

    ReplyDelete
  3. I would suggest changes to robots.txt instead. Something like:

    Noindex-class:advertising,links Noindex-id:nav,footer

    ReplyDelete