Tuesday, October 26, 2004

Finding mirrored hosts and duplicate content

One of the interesting technical challenges that every search engine comes up against is identifying duplicate content. Super-geek, Greg Linden, learns of a whitepaper from his friend Jeff Dean who co-authored (along with Krishna Bharat, Andrei Z. Broder and Monika Rauch Henzinger) a paper (PDF) on this very topic.
"The paper analyzes the performance of several techniques for detecting mirrors, from simple approaches like the similar IP address or hostname to more complicated and quite clever analysis of the link structure of sites. The paper concludes that a content-based approach (called 'shingles' in the paper) works well but that a combination of several approaches works best."


No comments:

Post a Comment