17 extensions,developers are still burdened to keep track of how resources are made avail- able to workers.Additionally,these frameworks are designed to scale out computational resources and have only rudimentary support for dealing with very large amounts of input data.When using existing parallel computing approaches for large-data computa- tion,the programmer must devote a significant amount of attention to low-level system details,which detracts from higher-level problem solving. One of the most significant advantages of MapReduce is that it provides an ab- straction that hides many system-level details from the programmer.Therefore,a devel- oper can focus on what computations need to be performed,as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them.Like OpenMP and MPI,MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed com- puting (but at a different level of granularity).However,organizing and coordinating large amounts of computation is only part of the challenge.Large-data processing by definition requires bringing data and code together for computation to occur-no small feat for datasets that are terabytes and perhaps petabytes in size!MapReduce addresses this challenge by providing a simple abstraction for the developer,transparently han- dling most of the details behind the scenes in a scalable,robust,and efficient manner. As we mentioned in Chapter 1,instead of moving large amounts of data around,it is far more efficient,if possible,to move the code to the data.This is operationally realized by spreading data across the local disks of machines in a cluster and running processes on machines that hold the data.The complex task of managing storage in such a processing environment is handled by the distributed file system that underlies MapReduce. This chapter introduces the MapReduce programming model and the underlying distributed file system.We start in Chapter 2.1 with an overview of functional pro- gramming,from which MapReduce draws its inspiration.Chapter 2.2 introduces the basic programming model,focusing on mappers and reducers.Chapter 2.3 discusses the role of the execution framework in actually running MapReduce programs(called jobs). Chapter 2.4 fills in additional details by introducing partitioners and combiners,which provide greater control over data flow.MapReduce would not be practical without a tightly-integrated distributed file system that manages the data being processed;Chap- ter 2.5 covers this in detail.Tying everything together,a complete cluster architecture is described in Chapter 2.6 before the chapter ends with a summary
17 extensions, developers are still burdened to keep track of how resources are made available to workers. Additionally, these frameworks are designed to scale out computational resources and have only rudimentary support for dealing with very large amounts of input data. When using existing parallel computing approaches for large-data computation, the programmer must devote a significant amount of attention to low-level system details, which detracts from higher-level problem solving. One of the most significant advantages of MapReduce is that it provides an abstraction that hides many system-level details from the programmer. Therefore, a developer can focus on what computations need to be performed, as opposed to how those computations are actually carried out or how to get the data to the processes that depend on them. Like OpenMP and MPI, MapReduce provides a means to distribute computation without burdening the programmer with the details of distributed computing (but at a different level of granularity). However, organizing and coordinating large amounts of computation is only part of the challenge. Large-data processing by definition requires bringing data and code together for computation to occur—no small feat for datasets that are terabytes and perhaps petabytes in size! MapReduce addresses this challenge by providing a simple abstraction for the developer, transparently handling most of the details behind the scenes in a scalable, robust, and efficient manner. As we mentioned in Chapter 1, instead of moving large amounts of data around, it is far more efficient, if possible, to move the code to the data. This is operationally realized by spreading data across the local disks of machines in a cluster and running processes on machines that hold the data. The complex task of managing storage in such a processing environment is handled by the distributed file system that underlies MapReduce. This chapter introduces the MapReduce programming model and the underlying distributed file system. We start in Chapter 2.1 with an overview of functional programming, from which MapReduce draws its inspiration. Chapter 2.2 introduces the basic programming model, focusing on mappers and reducers. Chapter 2.3 discusses the role of the execution framework in actually running MapReduce programs (called jobs). Chapter 2.4 fills in additional details by introducing partitioners and combiners, which provide greater control over data flow. MapReduce would not be practical without a tightly-integrated distributed file system that manages the data being processed; Chapter 2.5 covers this in detail. Tying everything together, a complete cluster architecture is described in Chapter 2.6 before the chapter ends with a summary
18 CHAPTER 2.MAPREDUCE BASICS Figure 2.1:Illustration of map and fold,two higher-order functions commonly used together in functional programming:map takes a function f and applies it to every element in a list, while fold iteratively applies a function g to aggregate results. 2.1 FUNCTIONAL PROGRAMMING ROOTS MapReduce has its roots in functional programming,which is exemplified in languages such as Lisp and ML.4 A key feature of functional languages is the concept of higher- order functions,or functions that can accept other functions as arguments.Two com- monly built-in higher order functions are map and fold,illustrated in Figure 2.1.Given a list,map takes as an argument a function f (that takes a single argument)and ap- plies it to all elements in a list (the top part of the diagram).Given a list,fold takes as arguments a function g (that takes two arguments)and an initial value:g is first applied to the initial value and the first item in a list,the result of which is stored in an intermediate variable.This intermediate variable and the next item in the list serve as the arguments to a second application of g,the results of which are stored in the intermediate variable.This process repeats until all items in the list have been consumed;fold then returns the final value of the intermediate variable.Typically,map and fold are used in combination.For example,to compute the sum of squares of a list of integers,one could map a function that squares its argument (i.e.,Ax.22)over the input list,and then fold the resulting list with the addition function (more precisely, 入xAy.x+y)using an initial value of zero. We can view map as a concise way to represent the transformation of a dataset (as defined by the function f).In the same vein,we can view fold as an aggregation operation,as defined by the function g.One immediate observation is that the appli- 4However,there are important characteristics of MapReduce that make it non-functional in nature-this will become apparent later
18 CHAPTER 2. MAPREDUCE BASICS f f f f f g g g g g Figure 2.1: Illustration of map and fold, two higher-order functions commonly used together in functional programming: map takes a function f and applies it to every element in a list, while fold iteratively applies a function g to aggregate results. 2.1 FUNCTIONAL PROGRAMMING ROOTS MapReduce has its roots in functional programming, which is exemplified in languages such as Lisp and ML.4 A key feature of functional languages is the concept of higherorder functions, or functions that can accept other functions as arguments. Two commonly built-in higher order functions are map and fold, illustrated in Figure 2.1. Given a list, map takes as an argument a function f (that takes a single argument) and applies it to all elements in a list (the top part of the diagram). Given a list, fold takes as arguments a function g (that takes two arguments) and an initial value: g is first applied to the initial value and the first item in a list, the result of which is stored in an intermediate variable. This intermediate variable and the next item in the list serve as the arguments to a second application of g, the results of which are stored in the intermediate variable. This process repeats until all items in the list have been consumed; fold then returns the final value of the intermediate variable. Typically, map and fold are used in combination. For example, to compute the sum of squares of a list of integers, one could map a function that squares its argument (i.e., λx.x2 ) over the input list, and then fold the resulting list with the addition function (more precisely, λxλy.x + y) using an initial value of zero. We can view map as a concise way to represent the transformation of a dataset (as defined by the function f). In the same vein, we can view fold as an aggregation operation, as defined by the function g. One immediate observation is that the appli- 4However, there are important characteristics of MapReduce that make it non-functional in nature—this will become apparent later
2.2.MAPPERS AND REDUCERS 19 cation of f to each item in a list (or more generally,to elements in a large dataset) can be parallelized in a straightforward manner,since each functional application hap- pens in isolation.In a cluster,these operations can be distributed across many different machines.The fold operation,on the other hand,has more restrictions on data locality- elements in the list must be "brought together"before the function g can be applied. However,many real-world applications do not require g to be applied to all elements of the list.To the extent that elements in the list can be divided into groups,the fold aggregations can proceed in parallel.Furthermore,for operations that are commutative and associative,significant efficiencies can be gained in the fold operation through local aggregation and appropriate reordering. In a nutshell,we have described MapReduce.The map phase in MapReduce roughly corresponds to the map operation in functional programming,whereas the reduce phase in MapReduce roughly corresponds to the fold operation in functional programming.As we will discuss in detail shortly,the MapReduce execution framework coordinates the map and reduce phases of processing over large amounts of data on large clusters of commodity machines. Viewed from a slightly different angle,MapReduce codifies a generic "recipe" for processing large datasets that consists of two stages.In the first stage,a user- specified computation is applied over all input records in a dataset.These operations occur in parallel and yield intermediate output that is then aggregated by another user- specified computation.The programmer defines these two types of computations,and the execution framework coordinates the actual processing(very loosely,MapReduce provides a functional abstraction).Although such a two-phase processing structure may appear to be very restrictive,many interesting algorithms can actually be expressed quite concisely-especially if one decomposes complex algorithms into a sequence of MapReduce jobs.Subsequent chapters in this book will focus on how a number of algorithms can be implemented in MapReduce. 2.2 MAPPERS AND REDUCERS Key-value pairs form the basic data structure in MapReduce.For a collection of web pages,keys may be URLs and values may be the actual HTML content.For a graph,keys may represent nodes and values represent adjacency lists of those nodes (see Chapter 5 for more details).Part of the design of MapReduce algorithms involves imposing this structure on arbitrary datasets.It is not necessary for the keys to be meaningful,but keys are often used to uniquely identify input data In MapReduce,the programmer defines a mapper and a reducer with the following signatures: map:(k1,1)一[(k2,2)] reduce:(k2,[v2])[(k3,v3)]
2.2. MAPPERS AND REDUCERS 19 cation of f to each item in a list (or more generally, to elements in a large dataset) can be parallelized in a straightforward manner, since each functional application happens in isolation. In a cluster, these operations can be distributed across many different machines. The fold operation, on the other hand, has more restrictions on data locality— elements in the list must be “brought together” before the function g can be applied. However, many real-world applications do not require g to be applied to all elements of the list. To the extent that elements in the list can be divided into groups, the fold aggregations can proceed in parallel. Furthermore, for operations that are commutative and associative, significant efficiencies can be gained in the fold operation through local aggregation and appropriate reordering. In a nutshell, we have described MapReduce. The map phase in MapReduce roughly corresponds to the map operation in functional programming, whereas the reduce phase in MapReduce roughly corresponds to the fold operation in functional programming. As we will discuss in detail shortly, the MapReduce execution framework coordinates the map and reduce phases of processing over large amounts of data on large clusters of commodity machines. Viewed from a slightly different angle, MapReduce codifies a generic “recipe” for processing large datasets that consists of two stages. In the first stage, a userspecified computation is applied over all input records in a dataset. These operations occur in parallel and yield intermediate output that is then aggregated by another userspecified computation. The programmer defines these two types of computations, and the execution framework coordinates the actual processing (very loosely, MapReduce provides a functional abstraction). Although such a two-phase processing structure may appear to be very restrictive, many interesting algorithms can actually be expressed quite concisely—especially if one decomposes complex algorithms into a sequence of MapReduce jobs. Subsequent chapters in this book will focus on how a number of algorithms can be implemented in MapReduce. 2.2 MAPPERS AND REDUCERS Key-value pairs form the basic data structure in MapReduce. For a collection of web pages, keys may be URLs and values may be the actual HTML content. For a graph, keys may represent nodes and values represent adjacency lists of those nodes (see Chapter 5 for more details). Part of the design of MapReduce algorithms involves imposing this structure on arbitrary datasets. It is not necessary for the keys to be meaningful, but keys are often used to uniquely identify input data. In MapReduce, the programmer defines a mapper and a reducer with the following signatures: map: (k1, v1) → [(k2, v2)] reduce: (k2, [v2]) → [(k3, v3)]
20 CHAPTER 2.MAPREDUCE BASICS kk2kk☑k购ka mapper mapper mapper mapper a 1 b 2 c 3 c G a 5c2 b 7 c 3 Shuffle and Sort:aggregate values by keys a15 b2⑦ c2 9 8 reducer reducer reducer s 2s2 r sa Figure 2.2:Simplified view of MapReduce.Mappers are applied to all input key-value pairs, which generate an arbitrary number of intermediate key-value pairs.Reducers are applied to all values associated with the same key.Between the map and reduce phases lies a barrier that involves a large distributed sort and group by. The input to a MapReduce job starts as data stored on the underlying distributed file system (see Chapter 2.5).The mapper is applied to every input key-value pair (split across an arbitrary number of files)to generate an arbitrary number of intermediate key-value pairs (the convention [...is used throughout this book to denote a list).The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.5 Implicit between the map and reduce phases is a distributed "group by"operation on intermediate keys.Intermediate data arrive at each reducer in order,sorted by the key.However,no ordering relationship is guaranteed for keys across different reducers.Output key-value pairs from each reducer are written persistently back onto the distributed file system(whereas intermediate key-value pairs are transient and not preserved).The output ends up in r files on the distributed file system,where r is the number of reducers.For the most part,there is no need to consolidate reducer output,since the r files often serve as input to yet another MapReduce job.Figure 2.2 illustrates this two-stage processing structure. A simple word count algorithm in MapReduce is shown in Figure 2.3.This algo- rithm counts the number of occurrences of every word in a text collection,which may be the first step in,for example,building a unigram language model(i.e.,probability 5This characterization,while conceptually accurate,is a slight simplification.See Chapter 2.6 for more details
20 CHAPTER 2. MAPREDUCE BASICS k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 1 b 2 3 6 5 2 b 7 8 mapper mapper mapper mapper Shuffle and Sort: aggregate values by keys a 1 b 2 c c 3 6 a c 5 2 b 7 c 8 a 1 5 b 2 7 c 2 9 8 reducer reducer reducer r1 s1 r2 s2 r3 s3 Figure 2.2: Simplified view of MapReduce. Mappers are applied to all input key-value pairs, which generate an arbitrary number of intermediate key-value pairs. Reducers are applied to all values associated with the same key. Between the map and reduce phases lies a barrier that involves a large distributed sort and group by. The input to a MapReduce job starts as data stored on the underlying distributed file system (see Chapter 2.5). The mapper is applied to every input key-value pair (split across an arbitrary number of files) to generate an arbitrary number of intermediate key-value pairs (the convention [. . .] is used throughout this book to denote a list). The reducer is applied to all values associated with the same intermediate key to generate output key-value pairs.5 Implicit between the map and reduce phases is a distributed “group by” operation on intermediate keys. Intermediate data arrive at each reducer in order, sorted by the key. However, no ordering relationship is guaranteed for keys across different reducers. Output key-value pairs from each reducer are written persistently back onto the distributed file system (whereas intermediate key-value pairs are transient and not preserved). The output ends up in r files on the distributed file system, where r is the number of reducers. For the most part, there is no need to consolidate reducer output, since the r files often serve as input to yet another MapReduce job. Figure 2.2 illustrates this two-stage processing structure. A simple word count algorithm in MapReduce is shown in Figure 2.3. This algorithm counts the number of occurrences of every word in a text collection, which may be the first step in, for example, building a unigram language model (i.e., probability 5This characterization, while conceptually accurate, is a slight simplification. See Chapter 2.6 for more details
2.2.MAPPERS AND REDUCERS 21 1:class MAPPER 2: method MAP(docid a,doc d) 3: for all term t∈doc d do 4: EMIT(term t,count 1) 1:class REDUCER 2: method REDUCE(term t,counts [c1,c2,...) 3: 8um←-0 4: for all count cE counts [c1,c2,...]do 5: sum←-sum+c 6: EMIT(term t,count sum) Figure 2.3:Pseudo-code for the word count algorithm in MapReduce.The mapper emits an intermediate key-value pair for each word in a document.The reducer sums up all counts for each word. distribution over words in a collection).Input key-values pairs take the form of (docid, doc)pairs stored on the distributed file system,where the former is a unique identifier for the document,and the latter is the text of the document itself.The mapper takes an input key-value pair,tokenizes the document,and emits an intermediate key-value pair for every word:the word itself serves as the key,and the integer one serves as the value (denoting that we've seen the word once).The MapReduce execution framework guarantees that all values associated with the same key will be brought together in the reducer.Therefore,in our word count algorithm,we simply need to sum up all counts (ones)associated with each word.The reducer does exactly this,and emits final key-value pairs with the word as the key,and the count as the value.Final output is written to the distributed file system,one file per reducer.Words within each file will be sorted by alphabetical order,and each file will contain roughly the same number of words.The partitioner,which we discuss later in Chapter 2.4,controls the assignment of words to reducers.The output can be examined by the programmer or used as input to another MapReduce program. To provide a bit more implementation detail:pseudo-code provided in this book roughly mirrors how MapReduce programs are written in Hadoop,the open-source Java implementation.Mappers and reducers are objects that implement the MAP and REDUCE methods,respectively.In Hadoop,a mapper object is initialized for each map task (associated with a particular sequence of key-value pairs called an input split) and the MAP method is called on each key-value pair by the execution framework. In configuring a MapReduce job,the programmer provides a hint on the number of map tasks to run,but the execution framework (see next section)makes the final determination based on the physical layout of the data (more details in Chapter 2.5
2.2. MAPPERS AND REDUCERS 21 1: class Mapper 2: method Map(docid a, doc d) 3: for all term t ∈ doc d do 4: Emit(term t, count 1) 1: class Reducer 2: method Reduce(term t, counts [c1, c2, . . .]) 3: sum ← 0 4: for all count c ∈ counts [c1, c2, . . .] do 5: sum ← sum + c 6: Emit(term t, count sum) Figure 2.3: Pseudo-code for the word count algorithm in MapReduce. The mapper emits an intermediate key-value pair for each word in a document. The reducer sums up all counts for each word. distribution over words in a collection). Input key-values pairs take the form of (docid, doc) pairs stored on the distributed file system, where the former is a unique identifier for the document, and the latter is the text of the document itself. The mapper takes an input key-value pair, tokenizes the document, and emits an intermediate key-value pair for every word: the word itself serves as the key, and the integer one serves as the value (denoting that we’ve seen the word once). The MapReduce execution framework guarantees that all values associated with the same key will be brought together in the reducer. Therefore, in our word count algorithm, we simply need to sum up all counts (ones) associated with each word. The reducer does exactly this, and emits final key-value pairs with the word as the key, and the count as the value. Final output is written to the distributed file system, one file per reducer. Words within each file will be sorted by alphabetical order, and each file will contain roughly the same number of words. The partitioner, which we discuss later in Chapter 2.4, controls the assignment of words to reducers. The output can be examined by the programmer or used as input to another MapReduce program. To provide a bit more implementation detail: pseudo-code provided in this book roughly mirrors how MapReduce programs are written in Hadoop, the open-source Java implementation. Mappers and reducers are objects that implement the Map and Reduce methods, respectively. In Hadoop, a mapper object is initialized for each map task (associated with a particular sequence of key-value pairs called an input split) and the Map method is called on each key-value pair by the execution framework. In configuring a MapReduce job, the programmer provides a hint on the number of map tasks to run, but the execution framework (see next section) makes the final determination based on the physical layout of the data (more details in Chapter 2.5