12 CHAPTER 1.INTRODUCTION developer expresses computations in the programming model,code is guaranteed to behave as expected.The upshot is that the developer is freed from having to worry about system-level details (e.g,no more debugging race conditions and addressing lock contention)and can instead focus on algorithm or application design. Seamless scalability.For data-intensive processing,it goes without saying that scal- able algorithms are highly desirable.As an aspiration,let us sketch the behavior of an ideal algorithm.We can define scalability along at least two dimensions.First,in terms of data:given twice the amount of data,the same algorithm should take at most twice as long to run,all else being equal.Second,in terms of resources:given a cluster twice the size,the same algorithm should take no more than half as long to run.Furthermore, an ideal algorithm would maintain these desirable scaling characteristics across a wide range of settings:on data ranging from gigabytes to petabytes,on clusters consisting of a few to a few thousand machines.Finally,the ideal algorithm would exhibit these desired behaviors without requiring any modifications whatsoever,not even tuning of parameters. Other than for embarrassingly parallel problems,algorithms with the character- istics sketched above are,of course,unobtainable.One of the fundamental assertions in Fred Brook's classic The Mythical Man-Month [19]is that adding programmers to a project behind schedule will only make it fall further behind.This is because complex tasks cannot be chopped into smaller pieces and allocated in a linear fashion,and is often illustrated with a cute quote:"nine women cannot have a baby in one month". Although Brook's observations are primarily about software engineers and the soft- ware development process,the same is also true of algorithms:increasing the degree of parallelization also increases communication costs.The algorithm designer is faced with diminishing returns,and beyond a certain point,greater efficiencies gained by parallelization are entirely offset by increased communication requirements. Nevertheless,these fundamental limitations shouldn't prevent us from at least striving for the unobtainable.The truth is that most current algorithms are far from the ideal.In the domain of text processing,for example,most algorithms today assume that data fits in memory on a single machine.For the most part,this is a fair assumption. But what happens when the amount of data doubles in the near future,and then doubles again shortly thereafter?Simply buying more memory is not a viable solution,as the amount of data is growing faster than the price of memory is falling.Furthermore,the price of a machine does not scale linearly with the amount of available memory beyond a certain point(once again,the scaling“up”vs.scaling“out”argument).Quite simply, algorithms that require holding intermediate data in memory on a single machine will simply break on sufficiently-large datasets-moving from a single machine to a cluster architecture requires fundamentally different algorithms(and reimplementations)
12 CHAPTER 1. INTRODUCTION developer expresses computations in the programming model, code is guaranteed to behave as expected. The upshot is that the developer is freed from having to worry about system-level details (e.g., no more debugging race conditions and addressing lock contention) and can instead focus on algorithm or application design. Seamless scalability. For data-intensive processing, it goes without saying that scalable algorithms are highly desirable. As an aspiration, let us sketch the behavior of an ideal algorithm. We can define scalability along at least two dimensions. First, in terms of data: given twice the amount of data, the same algorithm should take at most twice as long to run, all else being equal. Second, in terms of resources: given a cluster twice the size, the same algorithm should take no more than half as long to run. Furthermore, an ideal algorithm would maintain these desirable scaling characteristics across a wide range of settings: on data ranging from gigabytes to petabytes, on clusters consisting of a few to a few thousand machines. Finally, the ideal algorithm would exhibit these desired behaviors without requiring any modifications whatsoever, not even tuning of parameters. Other than for embarrassingly parallel problems, algorithms with the characteristics sketched above are, of course, unobtainable. One of the fundamental assertions in Fred Brook’s classic The Mythical Man-Month [19] is that adding programmers to a project behind schedule will only make it fall further behind. This is because complex tasks cannot be chopped into smaller pieces and allocated in a linear fashion, and is often illustrated with a cute quote: “nine women cannot have a baby in one month”. Although Brook’s observations are primarily about software engineers and the software development process, the same is also true of algorithms: increasing the degree of parallelization also increases communication costs. The algorithm designer is faced with diminishing returns, and beyond a certain point, greater efficiencies gained by parallelization are entirely offset by increased communication requirements. Nevertheless, these fundamental limitations shouldn’t prevent us from at least striving for the unobtainable. The truth is that most current algorithms are far from the ideal. In the domain of text processing, for example, most algorithms today assume that data fits in memory on a single machine. For the most part, this is a fair assumption. But what happens when the amount of data doubles in the near future, and then doubles again shortly thereafter? Simply buying more memory is not a viable solution, as the amount of data is growing faster than the price of memory is falling. Furthermore, the price of a machine does not scale linearly with the amount of available memory beyond a certain point (once again, the scaling “up” vs. scaling “out” argument). Quite simply, algorithms that require holding intermediate data in memory on a single machine will simply break on sufficiently-large datasets—moving from a single machine to a cluster architecture requires fundamentally different algorithms (and reimplementations)
1.3.WHY IS THIS DIFFERENT?13 Perhaps the most exciting aspect of MapReduce is that it represents a small step toward algorithms that behave in the ideal manner discussed above.Recall that the programming model maintains a clear separation between what computations need to occur with how those computations are actually orchestrated on a cluster.As a result, a MapReduce algorithm remains fixed,and it is the responsibility of the execution framework to execute the algorithm.Amazingly,the MapReduce programming model is simple enough that it is actually possible,in many circumstances,to approach the ideal scaling characteristics discussed above.We introduce the idea of the "tradeable machine hour",as a play on Brook's classic title.If running an algorithm on a particular dataset takes 100 machine hours,then we should be able to finish in an hour on a cluster of 100 machines,or use a cluster of 10 machines to complete the same task in ten hours.12 With MapReduce,this isn't so far from the truth,at least for some applications. 1.3 WHY IS THIS DIFFERENT? "Due to the rapidly decreasing cost of processing,memory,and communica- tion,it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally intensive domains This,however,has not happened."-Leslie Valiant [104113 For several decades,computer scientists have predicted that the dawn of the age of parallel computing was "right around the corner"and that sequential processing would soon fade into obsolescence (consider,for example,the above quote).Yet,until very re- cently,they have been wrong.The relentless progress of Moore's Law for several decades has ensured that most of the world's problems could be solved by single-processor ma- chines,save the needs of a few (scientists simulating molecular interactions or nuclear explosions,for example).Couple that with the inherent challenges of concurrency,and the result has been that parallel processing and distributed systems have largely been confined to a small segment of the market and esoteric upper-level electives in the computer science curriculum. However,all of that changed around the middle of the first decade of this cen- tury.The manner in which the semiconductor industry had been exploiting Moore's Law simply ran out of opportunities for improvement:faster clocks,deeper pipelines, superscalar architectures,and other tricks of the trade reached a point of diminish- ing returns that did not justify continued investment.This marked the beginning of an entirely new strategy and the dawn of the multi-core era [81].Unfortunately,this radical shift in hardware architecture was not matched at that time by corresponding advances in how software could be easily designed for these new processors (but not for 12Note that this idea meshes well with utility computing,where a 100-machine cluster running for one hour would cost the same as a 10-machine cluster running for ten hours. 13Guess when this was written?You may be surprised
1.3. WHY IS THIS DIFFERENT? 13 Perhaps the most exciting aspect of MapReduce is that it represents a small step toward algorithms that behave in the ideal manner discussed above. Recall that the programming model maintains a clear separation between what computations need to occur with how those computations are actually orchestrated on a cluster. As a result, a MapReduce algorithm remains fixed, and it is the responsibility of the execution framework to execute the algorithm. Amazingly, the MapReduce programming model is simple enough that it is actually possible, in many circumstances, to approach the ideal scaling characteristics discussed above. We introduce the idea of the “tradeable machine hour”, as a play on Brook’s classic title. If running an algorithm on a particular dataset takes 100 machine hours, then we should be able to finish in an hour on a cluster of 100 machines, or use a cluster of 10 machines to complete the same task in ten hours.12 With MapReduce, this isn’t so far from the truth, at least for some applications. 1.3 WHY IS THIS DIFFERENT? “Due to the rapidly decreasing cost of processing, memory, and communication, it has appeared inevitable for at least two decades that parallel machines will eventually displace sequential ones in computationally intensive domains. This, however, has not happened.” — Leslie Valiant [104]13 For several decades, computer scientists have predicted that the dawn of the age of parallel computing was “right around the corner” and that sequential processing would soon fade into obsolescence (consider, for example, the above quote). Yet, until very recently, they have been wrong. The relentless progress of Moore’s Law for several decades has ensured that most of the world’s problems could be solved by single-processor machines, save the needs of a few (scientists simulating molecular interactions or nuclear explosions, for example). Couple that with the inherent challenges of concurrency, and the result has been that parallel processing and distributed systems have largely been confined to a small segment of the market and esoteric upper-level electives in the computer science curriculum. However, all of that changed around the middle of the first decade of this century. The manner in which the semiconductor industry had been exploiting Moore’s Law simply ran out of opportunities for improvement: faster clocks, deeper pipelines, superscalar architectures, and other tricks of the trade reached a point of diminishing returns that did not justify continued investment. This marked the beginning of an entirely new strategy and the dawn of the multi-core era [81]. Unfortunately, this radical shift in hardware architecture was not matched at that time by corresponding advances in how software could be easily designed for these new processors (but not for 12Note that this idea meshes well with utility computing, where a 100-machine cluster running for one hour would cost the same as a 10-machine cluster running for ten hours. 13Guess when this was written? You may be surprised
14 CHAPTER 1.INTRODUCTION lack of trying [75]).Nevertheless,parallel processing became an important issue at the forefront of everyone's mind-it represented the only way forward. At around the same time,we witnessed the growth of large-data problems.In the late 1990s and even during the beginning of the first decade of this century,relatively few organizations had data-intensive processing needs that required large clusters:a handful of internet companies and perhaps a few dozen large corporations.But then, everything changed.Through a combination of many different factors(falling prices of disks,rise of user-generated web content,etc.),large-data problems began popping up everywhere.Data-intensive processing needs became widespread,which drove innova- tions in distributed computing such as MapReduce-first by Google,and then by Yahoo and the open source community.This in turn created more demand:when organiza- tions learned about the availability of effective data analysis tools for large datasets, they began instrumenting various business processes to gather even more data-driven by the belief that more data leads to deeper insights and greater competitive advantages Today,not only are large-data problems ubiquitous,but technological solutions for ad- dressing them are widely accessible.Anyone can download the open source Hadoop implementation of MapReduce,pay a modest fee to rent a cluster from a utility cloud provider,and be happily processing terabytes upon terabytes of data within the week. Finally,the computer scientists are right-the age of parallel computing has begun, both in terms of multiple cores in a chip or multiple machines in a cluster. Why is MapReduce important?In practical terms,it provides a very effective tool for tackling large-data problems.But beyond that,MapReduce is important in how it has changed the way we organize computations at a massive scale.MapReduce represents the first step away from the von Neumann model that has served as the foundation of computer science over the last half plus century.Valiant called this a bridging model [104],a conceptual bridge between the physical implementation of a machine and the software that is to be executed on that machine.Until recently,the von Neumann model has served us well:Hardware designers focused on efficient im- plementations of the von Neumann model and didn't have to think much about actual software that would run on the machines.Similarly,the software industry developed software targeted at the model without worrying about the hardware details.The result was extraordinary growth:chip designers churned out successive generations of increas- ingly powerful processors,and software engineers were able to develop applications in high-level languages that exploited those processors. Today,however,the von Neumann model isn't sufficient anymore:we can't treat a multi-core processor or a large cluster as an agglomeration of many von Neumann machine instances communicating over some interconnect.Such a view places too much burden on the software developer to effectively take advantage of available computa- tional resources-it simply is the wrong level of abstraction.MapReduce can be viewed
14 CHAPTER 1. INTRODUCTION lack of trying [75]). Nevertheless, parallel processing became an important issue at the forefront of everyone’s mind—it represented the only way forward. At around the same time, we witnessed the growth of large-data problems. In the late 1990s and even during the beginning of the first decade of this century, relatively few organizations had data-intensive processing needs that required large clusters: a handful of internet companies and perhaps a few dozen large corporations. But then, everything changed. Through a combination of many different factors (falling prices of disks, rise of user-generated web content, etc.), large-data problems began popping up everywhere. Data-intensive processing needs became widespread, which drove innovations in distributed computing such as MapReduce—first by Google, and then by Yahoo and the open source community. This in turn created more demand: when organizations learned about the availability of effective data analysis tools for large datasets, they began instrumenting various business processes to gather even more data—driven by the belief that more data leads to deeper insights and greater competitive advantages. Today, not only are large-data problems ubiquitous, but technological solutions for addressing them are widely accessible. Anyone can download the open source Hadoop implementation of MapReduce, pay a modest fee to rent a cluster from a utility cloud provider, and be happily processing terabytes upon terabytes of data within the week. Finally, the computer scientists are right—the age of parallel computing has begun, both in terms of multiple cores in a chip or multiple machines in a cluster. Why is MapReduce important? In practical terms, it provides a very effective tool for tackling large-data problems. But beyond that, MapReduce is important in how it has changed the way we organize computations at a massive scale. MapReduce represents the first step away from the von Neumann model that has served as the foundation of computer science over the last half plus century. Valiant called this a bridging model [104], a conceptual bridge between the physical implementation of a machine and the software that is to be executed on that machine. Until recently, the von Neumann model has served us well: Hardware designers focused on efficient implementations of the von Neumann model and didn’t have to think much about actual software that would run on the machines. Similarly, the software industry developed software targeted at the model without worrying about the hardware details. The result was extraordinary growth: chip designers churned out successive generations of increasingly powerful processors, and software engineers were able to develop applications in high-level languages that exploited those processors. Today, however, the von Neumann model isn’t sufficient anymore: we can’t treat a multi-core processor or a large cluster as an agglomeration of many von Neumann machine instances communicating over some interconnect. Such a view places too much burden on the software developer to effectively take advantage of available computational resources—it simply is the wrong level of abstraction. MapReduce can be viewed
1.4.WHAT THIS BOOK IS NOT 15 as the first breakthrough in the quest for new abstractions that allow us to organize computations,not over individual machines,but over entire clusters.As Barroso puts it,the datacenter is the computer [83]. As anyone who has taken an introductory computer science course knows,ab- stractions manage complexity by hiding details and presenting well-defined behaviors to users of those abstractions.They,inevitably,are imperfect-making certain tasks easier but others more difficult,and sometimes,impossible (in the case where the de- tail suppressed by the abstraction is exactly what the user cares about).This critique applies to MapReduce:it makes certain large-data problems easier,but suffers from limitations as well.This means that MapReduce is not the final word,but rather the first in a new class of models that will allow us to more effectively organize computations at a massive scale. So if MapReduce is only the beginning,what's next beyond MapReduce?We're getting ahead of ourselves,as we can't meaningfully answer this question before thor- oughly understanding what MapReduce can and cannot do well.This is exactly the purpose of this book:let us now begin our exploration. 1.4 WHAT THIS BOOK IS NOT Actually,not quite yet...A final word before we get started.This book is about Map- Reduce algorithm design,particularly for text processing applications.Although our presentation most closely follows implementations in the Hadoop open-source imple- mentation of MapReduce,this book is explicitly not about Hadoop programming.We don't for example,discuss APIs,driver programs for composing jobs,command-line invocations for running jobs,etc.For those aspects,we refer the reader to Tom White's excellent book,"Hadoop:The Definitive Guide",published by O'Reilly [107]
1.4. WHAT THIS BOOK IS NOT 15 as the first breakthrough in the quest for new abstractions that allow us to organize computations, not over individual machines, but over entire clusters. As Barroso puts it, the datacenter is the computer [83]. As anyone who has taken an introductory computer science course knows, abstractions manage complexity by hiding details and presenting well-defined behaviors to users of those abstractions. They, inevitably, are imperfect—making certain tasks easier but others more difficult, and sometimes, impossible (in the case where the detail suppressed by the abstraction is exactly what the user cares about). This critique applies to MapReduce: it makes certain large-data problems easier, but suffers from limitations as well. This means that MapReduce is not the final word, but rather the first in a new class of models that will allow us to more effectively organize computations at a massive scale. So if MapReduce is only the beginning, what’s next beyond MapReduce? We’re getting ahead of ourselves, as we can’t meaningfully answer this question before thoroughly understanding what MapReduce can and cannot do well. This is exactly the purpose of this book: let us now begin our exploration. 1.4 WHAT THIS BOOK IS NOT Actually, not quite yet. . . A final word before we get started. This book is about MapReduce algorithm design, particularly for text processing applications. Although our presentation most closely follows implementations in the Hadoop open-source implementation of MapReduce, this book is explicitly not about Hadoop programming. We don’t for example, discuss APIs, driver programs for composing jobs, command-line invocations for running jobs, etc. For those aspects, we refer the reader to Tom White’s excellent book, “Hadoop: The Definitive Guide”, published by O’Reilly [107]
16 CHAPTER 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and con- quer,a fundamental concept in computer science that is introduced very early in typical undergraduate curricula.The basic idea is to partition a large problem into smaller pieces,each of which is tackled in parallel by different workers-which may be threads in a processor core,cores in a multi-core processor,multiple processors in a machine,or many machines in a cluster.Intermediate results from each individual worker are then combined to yield the final output.1 The general principles behind divide-and-conquer algorithms are broadly appli- cable to a wide range of problems.However,the details of their implementations are varied and complex.For example,the following are just some of the issues that need to be addressed: How do we split up a large problem into smaller tasks? How do we assign tasks to workers distributed across a potentially large number of machines? How do we ensure that workers get the data they need? How do we coordinate synchronization among the different workers? How do we share partial results from one worker that is needed by another? How do we accomplish all of the above in the face of software errors and hardware faults? In a traditional parallel or distributed programming environments,the developer needs to explicitly address all of the above issues.In the shared memory programming, the programmer needs to explicitly coordinate access to shared data structures through synchronization primitives such as mutexes,to explicitly handle process synchroniza- tion through devices such as barriers,and to remain ever vigilant for common problems such as deadlocks and race conditions.Language extensions,like OpenMP for shared memory parallelism,2 or libraries implementing the Message Passing Interface (MPI) for cluster-level parallelism,3 provide logical abstractions that hide details of operat- ing system synchronization and communications primitives.However,even with these 1We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift,but unfortunately they are far from being sufficiently mature to solve real world problems. 2http://www.openmp.org/ 3http://www.mcs.anl.gov/mpi/
16 C H A P T E R 2 MapReduce Basics The only feasible approach to tackling large-data problems today is to divide and conquer, a fundamental concept in computer science that is introduced very early in typical undergraduate curricula. The basic idea is to partition a large problem into smaller pieces, each of which is tackled in parallel by different workers—which may be threads in a processor core, cores in a multi-core processor, multiple processors in a machine, or many machines in a cluster. Intermediate results from each individual worker are then combined to yield the final output.1 The general principles behind divide-and-conquer algorithms are broadly applicable to a wide range of problems. However, the details of their implementations are varied and complex. For example, the following are just some of the issues that need to be addressed: • How do we split up a large problem into smaller tasks? • How do we assign tasks to workers distributed across a potentially large number of machines? • How do we ensure that workers get the data they need? • How do we coordinate synchronization among the different workers? • How do we share partial results from one worker that is needed by another? • How do we accomplish all of the above in the face of software errors and hardware faults? In a traditional parallel or distributed programming environments, the developer needs to explicitly address all of the above issues. In the shared memory programming, the programmer needs to explicitly coordinate access to shared data structures through synchronization primitives such as mutexes, to explicitly handle process synchronization through devices such as barriers, and to remain ever vigilant for common problems such as deadlocks and race conditions. Language extensions, like OpenMP for shared memory parallelism,2 or libraries implementing the Message Passing Interface (MPI) for cluster-level parallelism,3 provide logical abstractions that hide details of operating system synchronization and communications primitives. However, even with these 1We note that promising technologies such as quantum or biological computing could potentially induce a paradigm shift, but unfortunately they are far from being sufficiently mature to solve real world problems. 2http://www.openmp.org/ 3http://www.mcs.anl.gov/mpi/