Monday, October 25, 2004

Google Labs MapReduce - Part of Florida Update?

Not too long ago, GoogleGuy broke his silence over at Webmaster World and made mention of a new document.
"... if you didn't notice, Rob Pike is answering questions over at Slashdot:



http://interviews.slashdot.org/interviews/



Rob only joined Google a year or so ago, but I'm really glad we have him--I read his book a long time back. For the folks that like to read papers from Google, here's a systems one that Rob mentions:



http://labs.google.com/papers/mapreduce.html



I think it's pretty cool that Google keeps writing stuff like the GFS paper and the MapReduce paper; not too many other search engines are talking about their systems."
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.



In that newly released paper there is evidence of Google using this process on its production index in August of 2003. The next major update in that index appeared in November of 2003, the Florida Update.
"One of the most significant uses of MapReduce to date has been a complete rewrite of the production indexing system that produces the data structures used for the Google web search service. The indexing system takes as input a large set of documents that have been retrieved by our crawling system, stored as a set of GFS files. The raw contents for these documents are more than 20 terabytes of data."
Eric Baillargeon who discovered this passage is wondering if this is all coincidence.



One graph in the document shows the significant growth in the number of separate MapReduce programs checked into their primary source code management system over time, from 0 in early 2003 to almost 900 separate instances as of late September 2004.
"MapReduce has been so successful because it makes it possible to write a simple program and run it efficiently on a thousand machines in the course of half an hour, greatly speeding up the development and prototyping cycle. Furthermore, it allows programmers who have no experience with distributed and/or parallel systems to exploit large amounts of resources easily."

It is very evident in the graph that Google has sharply increased its usage of MapReduce since February of 2004, which coincided with another update at that time, Brandy.



Also, from this point in time, numerous backlink updates have taken place. I am assuming that MapReduce may have had something to do with new instances of it being introduced into parts of the source code that addresses this.



I don't think it is a matter of coincidence, as Eric mentioned, it is now a matter of record. MapReduce is now part of Google and understanding the mechanics of it will soon be discussed in many a forum from this time forward.

No comments:

Post a Comment