Assigning Tasks for Efficiency in Hadoop [Extended Abstract] Michael J.Fischer Xueyuan Su Yitong Yin Computer Science Computer Science State Key Laboratory for Novel Yale University Yale University Software Technology P○.Box208285 P.○.Box208285 Nanjing University,China New Haven,CT,USA New Haven,CT,USA yinyt@nju.edu.cn michael.fischer@yale.edu xueyuan.su@yale.edu ABSTRACT by Abstract Devices:Complexity Measures and In recent years Google's MapReduce has emerged as a lead- Classes-reducibility and completeness:F.2.2 Analysis of ing large-scale data processing architecture.Adopted by Algorithms and Problem Complexity:Nonnumerical companies such as Amazon,Facebook,Google,IBM and Algorithms and Problems-sequencing and scheduling Yahoo!in daily use,and more recently put in use by several universities,it allows parallel processing of huge volumes of data over cluster of machines.Hadoop is a free Java im- General Terms plementation of MapReduce.In Hadoop,files are split into Algorithms,Performance,Theory blocks and replicated and spread over all servers in a net- work.Each job is also split into many small pieces called tasks.Several tasks are processed on a single server.and Keywords a job is not completed until all the assigned tasks are fin- task assignment,load balancing,NP-completeness,approx- ished.A crucial factor that affects the completion time of a imation algorithm,MapReduce,Hadoop job is the particular assignment of tasks to servers.Given a placement of the input data over servers,one wishes to find the assignment that minimizes the completion time.In this 1.INTRODUCTION paper,an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem.It is shown that there is no feasible algorithm to find the optimal Hadoop task as- 1.1 Background signment unless P =AP.Assignments that are computed The cloud computing paradigm has recently received sig- by the round robin algorithm inspired by the current Hadoop nificant attention in the media.The cloud is a metaphor scheduler are shown to deviate from optimum by a multi- for the Internet,which is an abstraction for the complex plicative factor in the worst case.A flow-based algorithm infrastructure it conceals.Cloud computing refers to both is presented that computes assignments that are optimal to the applications delivered as services over the Internet and within an additive constant. the hardware and software that provide such services.It envisions shifting data storage and computing power away Categories and Subject Descriptors from local servers,across the network cloud,and into large clusters of machines hosted by companies such as Amazon D.3.2 Programming Languages:Language Classifica- Google,IBM,Microsoft,Yahoo!and so on tions-concurrent,distributed,and parallel languages;F.1.2 Google's MapReduce [8,9,16]parallel computing archi- Computation by Abstract Devices:Modes of Compu- tecture,for example,splits workload over large clusters of tation-parallelism and concurrency:F.1.3 Computation commodity PCs and enables automatic parallelization.By Supported by the Kempner Fellowship from the Depart- exploiting parallel processing,it provides a software plat- ment of Computer Science at Yale University. form that lets one easily write and run applications that fSupported by the National Science Foundation of China un- process vast amounts of data. der Grant No.60721002.This work was done when Yitong Apache Hadoop [4]is a free Java implementation of Yin was at Yale University. MapReduce in the open source software community.It is originally designed to efficiently process large volumes of data by parallel processing over commodity computers in local networks.In academia,researchers have adapted Permission to make digital or hard copies of all or part of this work for Hadoop to several different architectures.For example personal or classroom use is granted without fee provided that copies are Ranger et al.18 evaluate MapReduce in multi-core and not made or distributed for profit or commercial advantage and that copies multi-processor systems,Kruijf et al.[7]implement MapRe- bear this notice and the full citation on the first page.To copy otherwise,to duce on the Cell B.E.processor architecture,and He et republish,to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. al.[14]propose a MapReduce framework on graphics pro- SPAA'10.June 13-15,2010,Thira,Santorini,Greece. cessors.Many related applications using Hadoop have also Copyright2010ACM978-1-4503-0079-7/1006.$10.00. been developed to solve various practical problems
Assigning Tasks for Efficiency in Hadoop [Extended Abstract] Michael J. Fischer Computer Science Yale University P.O. Box 208285 New Haven, CT, USA michael.fischer@yale.edu Xueyuan Su ∗ Computer Science Yale University P.O. Box 208285 New Haven, CT, USA xueyuan.su@yale.edu Yitong Yin † State Key Laboratory for Novel Software Technology Nanjing University, China yinyt@nju.edu.cn ABSTRACT In recent years Google’s MapReduce has emerged as a leading large-scale data processing architecture. Adopted by companies such as Amazon, Facebook, Google, IBM and Yahoo! in daily use, and more recently put in use by several universities, it allows parallel processing of huge volumes of data over cluster of machines. Hadoop is a free Java implementation of MapReduce. In Hadoop, files are split into blocks and replicated and spread over all servers in a network. Each job is also split into many small pieces called tasks. Several tasks are processed on a single server, and a job is not completed until all the assigned tasks are finished. A crucial factor that affects the completion time of a job is the particular assignment of tasks to servers. Given a placement of the input data over servers, one wishes to find the assignment that minimizes the completion time. In this paper, an idealized Hadoop model is proposed to investigate the Hadoop task assignment problem. It is shown that there is no feasible algorithm to find the optimal Hadoop task assignment unless P = N P. Assignments that are computed by the round robin algorithm inspired by the current Hadoop scheduler are shown to deviate from optimum by a multiplicative factor in the worst case. A flow-based algorithm is presented that computes assignments that are optimal to within an additive constant. Categories and Subject Descriptors D.3.2 [Programming Languages]: Language Classifications—concurrent, distributed, and parallel languages; F.1.2 [Computation by Abstract Devices]: Modes of Computation—parallelism and concurrency; F.1.3 [Computation ∗ Supported by the Kempner Fellowship from the Department of Computer Science at Yale University. † Supported by the National Science Foundation of China under Grant No. 60721002. This work was done when Yitong Yin was at Yale University. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SPAA’10, June 13–15, 2010, Thira, Santorini, Greece. Copyright 2010 ACM 978-1-4503-0079-7/10/06 ...$10.00. by Abstract Devices]: Complexity Measures and Classes—reducibility and completeness; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems—sequencing and scheduling General Terms Algorithms, Performance, Theory Keywords task assignment, load balancing, NP-completeness, approximation algorithm, MapReduce, Hadoop 1. INTRODUCTION 1.1 Background The cloud computing paradigm has recently received significant attention in the media. The cloud is a metaphor for the Internet, which is an abstraction for the complex infrastructure it conceals. Cloud computing refers to both the applications delivered as services over the Internet and the hardware and software that provide such services. It envisions shifting data storage and computing power away from local servers, across the network cloud, and into large clusters of machines hosted by companies such as Amazon, Google, IBM, Microsoft, Yahoo! and so on. Google’s MapReduce [8, 9, 16] parallel computing architecture, for example, splits workload over large clusters of commodity PCs and enables automatic parallelization. By exploiting parallel processing, it provides a software platform that lets one easily write and run applications that process vast amounts of data. Apache Hadoop [4] is a free Java implementation of MapReduce in the open source software community. It is originally designed to efficiently process large volumes of data by parallel processing over commodity computers in local networks. In academia, researchers have adapted Hadoop to several different architectures. For example, Ranger et al. [18] evaluate MapReduce in multi-core and multi-processor systems, Kruijf et al. [7] implement MapReduce on the Cell B.E. processor architecture, and He et al. [14] propose a MapReduce framework on graphics processors. Many related applications using Hadoop have also been developed to solve various practical problems
1.2 The MapReduce Framework identical-machine model,there are some well-known approx- A Hadoop system runs on top of a distributed file sys- imation algorithms.For example,Graham [12]proposed a tem,called the Hadoop Distributed File System (HDFS) (2-1/n)-approximation algorithm in 1966,where n is the HDFS usually runs on networked commodity PCs,where total number of machines.Graham 13 proposed another data are replicated and locally stored on hard disks of each 4/3-approximation algorithm in 1969.However,under the machine.To store and process huge volume of data sets. unrelated-machine model,this problem is known to be APX- HDFS typically uses a block size of 64MB.Therefore.mov- hard,both in terms of its offline 17 and online 1,2 ap- ing computation close to the data is a design goal in the proximability. MapReduce framework. As some researchers [3,4 pointed out,the scheduling In the MapReduce framework,any application is specified mechanisms and polices that assign tasks to servers within by jobs.A MapReduce job splits the input data into inde- the MapReduce framework can have a profound effect on ef- pendent blocks,which are processed by the map tasks in ficiency.An early version of Hadoop uses a simple heuristic parallel.Each map task processes a single block'consisting algorithm that greedily exploits data locality.Zaharia,Kon- of some number of records.Each record in turn consists of winski and Joseph 19 proposed some heuristic refinements a key/value pair.A map task applies the user defined map based on experimental results. function to each input key/value pair and produces inter- 1.4 Our Contributions mediate key/value pairs.The framework then sorts the in- termediate data,and forwards them to the reduce tasks via We investigate task assignment in Hadoop.In Section 2, interconnected networks.After receiving all intermediate we propose an idealized Hadoop model to evaluate the cost key/value pairs with the same key,a reduce task executes of task assignments.Based on this model.we show in Sec- the user defined reduce function and produces the output tion 3 that there is no feasible algorithm to find the optimal data.Finally,these output data are written back to the assignment unless P =AP.In Section 4,we show that task HDFS. assignments computed by a simple greedy round-robin algo- In such a framework.there is a single server.called the rithm might deviate from the optimum by a multiplicative master,that keeps track of all jobs in the whole distributed factor.In Section 5,we present an algorithm that employs system.The master runs a special process,called the job- maximum flow and increasing threshold techniques to com- tracker,that is responsible for task assignment and schedul- pute task assignments that are optimal to within an additive ing for the whole system.For the rest of servers that are constant called the slaves,each of them runs a process called the task- tracker.The tasktracker schedules the several tasks assigned 2. PROBLEM FORMALIZATION to the single server in a way similar to a normal operating system. The map task assignment is a vital part that affects the Definition 1.A Map-Reduce schema (MR-schema)is a completion time of the whole job.First,each reduce task pair (T.S),where T is a set of tasks and S is a set of servers. cannot begin until it receives the required intermediate data Let m =T and n =S.A task assignment is a function from all finished map tasks.Second,the assignment de- A:T-S that assigns each task t to a server A(t).2 Let termines the location of intermediate data and the pattern A ={T-S}be the set of all possible task assignments. of the communication traffic.Therefore,some algorithms An MR-system is a triple (T,S,w),where (T,S)is an MR- should be in place to optimize the task assignment. schema and w:T×A→Q is a cost function. 1.3 Related Work Intuitively,w(t,A)is the time to perform task t on server A(t)in the context of the complete assignment A.The mo- Since Kuhn [15]proposed the first method for the classic tivation for this level of generality is that the time to execute assignment problem in 1955,variations of the assignment problem have been under extensive study in many areas [5] a task t in Hadoop depends not only on the task and the server speed,but also on possible network congestion,which In the classic assignment problem,there are identical num- in turn is influenced by the other tasks running on the clus- ber of jobs and persons.An assignment is a one-to-one map- ter. ping from tasks to persons.Each job introduces a cost when it is assigned to a person.Therefore,an optimal assignment Definition 2.The load of server s under assignment A minimizes the total cost over all persons. In the area of parallel and distributed computing,when is defined as L=:A()=.w(t,A).The marimum load jobs are processed in parallel over several machines,one under assignment A is defined as L4=max,L.The total is interested in minimizing the maximum processing time load under assignment A is defined as of any machines.This problem is sometimes called the minimum makespan scheduling problem.This problem An MR-system models a cloud computer where all servers in general is known to be AP-complete [11].Under the work in parallel.Tasks assigned to the same server are processed sequentially,whereas tasks assigned to different iStrictly speaking,a map task in Hadoop sometimes pro- servers run in parallel.Thus,the total completion time of cesses data that comes from two successive file blocks.This the cloud under task assignment A is given by the maximum occurs because file blocks do not respect logical record load LA boundaries,so the last logical record processed by a map task might lie partly in the current data block and partly 2In an MR-schema,it is common that IT>S.Therefore in the succeeding block,requiring the map task to access in this paper,unlike the classic assignment problem where an the succeeding block in order to fetch the tail end of its last assignment refers to a one-to-one mapping or a permutation logical record. 5,15,we instead use the notion of many-to-one mapping
1.2 The MapReduce Framework A Hadoop system runs on top of a distributed file system, called the Hadoop Distributed File System (HDFS). HDFS usually runs on networked commodity PCs, where data are replicated and locally stored on hard disks of each machine. To store and process huge volume of data sets, HDFS typically uses a block size of 64MB. Therefore, moving computation close to the data is a design goal in the MapReduce framework. In the MapReduce framework, any application is specified by jobs. A MapReduce job splits the input data into independent blocks, which are processed by the map tasks in parallel. Each map task processes a single block1 consisting of some number of records. Each record in turn consists of a key/value pair. A map task applies the user defined map function to each input key/value pair and produces intermediate key/value pairs. The framework then sorts the intermediate data, and forwards them to the reduce tasks via interconnected networks. After receiving all intermediate key/value pairs with the same key, a reduce task executes the user defined reduce function and produces the output data. Finally, these output data are written back to the HDFS. In such a framework, there is a single server, called the master, that keeps track of all jobs in the whole distributed system. The master runs a special process, called the jobtracker, that is responsible for task assignment and scheduling for the whole system. For the rest of servers that are called the slaves, each of them runs a process called the tasktracker. The tasktracker schedules the several tasks assigned to the single server in a way similar to a normal operating system. The map task assignment is a vital part that affects the completion time of the whole job. First, each reduce task cannot begin until it receives the required intermediate data from all finished map tasks. Second, the assignment determines the location of intermediate data and the pattern of the communication traffic. Therefore, some algorithms should be in place to optimize the task assignment. 1.3 Related Work Since Kuhn [15] proposed the first method for the classic assignment problem in 1955, variations of the assignment problem have been under extensive study in many areas [5]. In the classic assignment problem, there are identical number of jobs and persons. An assignment is a one-to-one mapping from tasks to persons. Each job introduces a cost when it is assigned to a person. Therefore, an optimal assignment minimizes the total cost over all persons. In the area of parallel and distributed computing, when jobs are processed in parallel over several machines, one is interested in minimizing the maximum processing time of any machines. This problem is sometimes called the minimum makespan scheduling problem. This problem in general is known to be N P-complete [11]. Under the 1Strictly speaking, a map task in Hadoop sometimes processes data that comes from two successive file blocks. This occurs because file blocks do not respect logical record boundaries, so the last logical record processed by a map task might lie partly in the current data block and partly in the succeeding block, requiring the map task to access the succeeding block in order to fetch the tail end of its last logical record. identical-machine model, there are some well-known approximation algorithms. For example, Graham [12] proposed a (2 − 1/n)-approximation algorithm in 1966, where n is the total number of machines. Graham [13] proposed another 4/3-approximation algorithm in 1969. However, under the unrelated-machine model, this problem is known to be APXhard, both in terms of its offline [17] and online [1, 2] approximability. As some researchers [3, 4] pointed out, the scheduling mechanisms and polices that assign tasks to servers within the MapReduce framework can have a profound effect on ef- ficiency. An early version of Hadoop uses a simple heuristic algorithm that greedily exploits data locality. Zaharia, Konwinski and Joseph [19] proposed some heuristic refinements based on experimental results. 1.4 Our Contributions We investigate task assignment in Hadoop. In Section 2, we propose an idealized Hadoop model to evaluate the cost of task assignments. Based on this model, we show in Section 3 that there is no feasible algorithm to find the optimal assignment unless P = N P. In Section 4, we show that task assignments computed by a simple greedy round-robin algorithm might deviate from the optimum by a multiplicative factor. In Section 5, we present an algorithm that employs maximum flow and increasing threshold techniques to compute task assignments that are optimal to within an additive constant. 2. PROBLEM FORMALIZATION Definition 1. A Map-Reduce schema (MR-schema) is a pair (T, S), where T is a set of tasks and S is a set of servers. Let m = |T| and n = |S|. A task assignment is a function A: T → S that assigns each task t to a server A(t).2 Let A = {T → S} be the set of all possible task assignments. An MR-system is a triple (T, S, w), where (T, S) is an MRschema and w : T × A → Q + is a cost function. Intuitively, w(t, A) is the time to perform task t on server A(t) in the context of the complete assignment A. The motivation for this level of generality is that the time to execute a task t in Hadoop depends not only on the task and the server speed, but also on possible network congestion, which in turn is influenced by the other tasks running on the cluster. Definition 2. The load of server s under assignment A is defined as L A s = P t:A(t)=s w(t, A). The maximum load under assignment A is defined as L A = maxs L A s . The total load under assignment A is defined as HA = P s L A s . An MR-system models a cloud computer where all servers work in parallel. Tasks assigned to the same server are processed sequentially, whereas tasks assigned to different servers run in parallel. Thus, the total completion time of the cloud under task assignment A is given by the maximum load L A. 2 In an MR-schema, it is common that |T| ≥ |S|. Therefore in this paper, unlike the classic assignment problem where an assignment refers to a one-to-one mapping or a permutation [5, 15], we instead use the notion of many-to-one mapping
Our notion of an MR-system is very general and admits The definition of remote cost under a partial assignment B arbitrary cost functions.To usefully model Hadoop as an is pessimistic.It assumes that tasks not assigned by B will MR-system,we need a realistic but simplified cost model. eventually become remote,and each remote task will even- In Hadoop,the cost of a map task depends frequently on tually have costrem(+).This definition agrees with the location of its data.If the data is on the server's local the definition of remote cost under a complete assignment A. disk,then the cost (execution time)is considerably lower because u4 =0 and thus wrem =Wrem(rA+u4)=Wrem(rA) than if the data is located remotely and must be fetched Since p is encoded by mn bits,wloe is encoded by one across the network before being processed. rational number,and wrem()is encoded by m+1 ratio- We make several simplifying assumptions.We assume nal numbers,the Hadoop cost function w(p,wloe,Wrem())is that all tasks and all servers are identical,so that for any encoded by mn bits plus m+2 rational numbers. particular assignment of tasks to servers,all tasks whose data is locally available take the same amount of time wloe, Definition 8.A Hadoop MR-system (HMR-system)is the and all tasks whose data is remote take the same amount MR-system (T,S,w),where w is the Hadoop cost function of time wrem.However,we do not assume that wrem is con- with parameters p,wloc,and wrem().A HMR-system is stant over all assignments.Rather,we let it grow with the defined by (T,S,p,whoe,Wrem()). total number of tasks whose data is remote.This reflects the increased data fetch time due to overall network con- gestion.Thus,wrem(r)is the cost of each remote task in Problem 1 Hadoop Task Assignment Problem (HTA) every assignment with exactly r remote tasks.We assume that wrem(r)>wioe for all r and that wrem(r)is (weakly) 1.Instance:An HMR-system (T,S,p,whoc,Wrem()). monotone increasing in r. We formalize these concepts below.In each of the follow- 2.Objective:Find an assignment A that minimizes L4. ing,(T,S)is an MR-schema. Definition 3.A data placement is a relation p C T x S Sometimes the cost of running a task on a server only such that for every task t E T,there exists at least one server depends on the placement relation and its data locality,but sES such that p(t,s)holds. not on the assignment of other tasks. The placement relation describes where the input data blocks are placed.If p(t,s)holds,then server s locally stores Definition 9.A Hadoop cost function w is called uniform if wrem(r)=c for some constant c and all r E N.A a replica of the data block that task t needs. uniform HMR-system (UHMR-system)is an HMR-system Definition 4.We represent the placement relation p by an (T,S,p,wloe,Wrem()),where w is uniform. unweighted bipartite graph,called the placement graph.In the placement graph Ge =((T,S),E),T consists of m task nodes and S consists of n server nodes.There is an edge Problem 2 Uniform Hadoop Task Assignment Problem (t,s)E iff p(t,s)holds. (UHTA) Definition 5.A partial assignment a is a partial function 1.Instance:A UHMR-system (T,S,p,wioe,Wrem()) from T to S.We regard a partial assignment as a set of 2.Objective:Find an assignment A that minimizes L4. ordered pairs with pairwise distinct first elements,so for partial assignments B and o,B a means B ertends a.If s ES,the restriction of a to s is the partial assignment als=an(T x {s}).Thus,als agrees with a for those tasks The number of replicas of each data block may be that a assigns to s,but all other tasks are unassigned in l. bounded,often by a small number such as 2 or 3. Definition 6.Let p be a data placement and 3 be a partial Definition 10.Call a placement graph G=((T,S),E)j- assignment.A task t E T is local in B if B(t)is defined and replica-bounded if the degree of t is at most j for all t E T. p(t,B(t)).A task tT is remote in a if B(t)is defined A j-replica-bounded-UHMR-system (j-UHMR-system)is a and p(t,B(t)).Otherwise t is unassigned in B.Let e UHMR-system(T,S,p,wloc,Wrem()),where Ge is j-replica- r3 and u3 be the number of local tasks,remote tasks,and bounded. unassigned tasks in B,respectively.For any s E S,let es be the number of local tasks assigned to s by B.Let k= maxs∈sg. Problem 3 j-Uniform Hadoop Task Assignment Problem (j-UHTA) Definition 7.Let p be a data placement,B be a partial assignment,hoe∈Q+,and Wrem:N一Q+such that wioe≤ 1.Instance:A j-UHMR-system (T,S,p,Wioe,Wrem()). wrem(O)≤rem(1)≤wrem(②).…Let wrem=urem(r3 2.Objective:Find an assignment A that minimizes L4. u).The Hadoop cost function with parameters p,wloe, and wrem()is the function w defined by w(t,B)= 了oe if t is local in B, 3.HARDNESS OF TASK ASSIGNMENT wrem otherwise. In this section,we analyze the hardness of the various We call p the placement of w,and wloe and wrem()the local HTA optimization problems by showing the corresponding and remote costs of w,respectively.Let K=k.woc. decision problems to be AP-complete
Our notion of an MR-system is very general and admits arbitrary cost functions. To usefully model Hadoop as an MR-system, we need a realistic but simplified cost model. In Hadoop, the cost of a map task depends frequently on the location of its data. If the data is on the server’s local disk, then the cost (execution time) is considerably lower than if the data is located remotely and must be fetched across the network before being processed. We make several simplifying assumptions. We assume that all tasks and all servers are identical, so that for any particular assignment of tasks to servers, all tasks whose data is locally available take the same amount of time wloc, and all tasks whose data is remote take the same amount of time wrem. However, we do not assume that wrem is constant over all assignments. Rather, we let it grow with the total number of tasks whose data is remote. This reflects the increased data fetch time due to overall network congestion. Thus, wrem(r) is the cost of each remote task in every assignment with exactly r remote tasks. We assume that wrem(r) ≥ wloc for all r and that wrem(r) is (weakly) monotone increasing in r. We formalize these concepts below. In each of the following, (T, S) is an MR-schema. Definition 3. A data placement is a relation ρ ⊆ T × S such that for every task t ∈ T, there exists at least one server s ∈ S such that ρ(t, s) holds. The placement relation describes where the input data blocks are placed. If ρ(t, s) holds, then server s locally stores a replica of the data block that task t needs. Definition 4. We represent the placement relation ρ by an unweighted bipartite graph, called the placement graph. In the placement graph Gρ = ((T, S), E), T consists of m task nodes and S consists of n server nodes. There is an edge (t, s) ∈ E iff ρ(t, s) holds. Definition 5. A partial assignment α is a partial function from T to S. We regard a partial assignment as a set of ordered pairs with pairwise distinct first elements, so for partial assignments β and α, β ⊇ α means β extends α. If s ∈ S, the restriction of α to s is the partial assignment α|s = α∩(T × {s}). Thus, α|s agrees with α for those tasks that α assigns to s, but all other tasks are unassigned in α|s. Definition 6. Let ρ be a data placement and β be a partial assignment. A task t ∈ T is local in β if β(t) is defined and ρ(t, β(t)). A task t ∈ T is remote in α if β(t) is defined and ¬ρ(t, β(t)). Otherwise t is unassigned in β. Let ` β , r β and u β be the number of local tasks, remote tasks, and unassigned tasks in β, respectively. For any s ∈ S, let ` β s be the number of local tasks assigned to s by β. Let k β = maxs∈S ` β s . Definition 7. Let ρ be a data placement, β be a partial assignment, wloc ∈ Q +, and wrem : N → Q + such that wloc ≤ wrem(0) ≤ wrem(1) ≤ wrem(2). . .. Let w β rem = wrem(r β + u β ). The Hadoop cost function with parameters ρ, wloc, and wrem(·) is the function w defined by w(t, β) = wloc if t is local in β, w β rem otherwise. We call ρ the placement of w, and wloc and wrem(·) the local and remote costs of w, respectively. Let Kβ = k β · wloc. The definition of remote cost under a partial assignment β is pessimistic. It assumes that tasks not assigned by β will eventually become remote, and each remote task will eventually have cost wrem(r β + u β ). This definition agrees with the definition of remote cost under a complete assignment A, because u A = 0 and thus w A rem = wrem(r A+u A) = wrem(r A). Since ρ is encoded by mn bits, wloc is encoded by one rational number, and wrem(·) is encoded by m + 1 rational numbers, the Hadoop cost function w(ρ, wloc, wrem(·)) is encoded by mn bits plus m + 2 rational numbers. Definition 8. A Hadoop MR-system (HMR-system) is the MR-system (T, S, w), where w is the Hadoop cost function with parameters ρ, wloc, and wrem(·). A HMR-system is defined by (T, S, ρ, wloc, wrem(·)). Problem 1 Hadoop Task Assignment Problem (HTA) 1. Instance: An HMR-system (T, S, ρ, wloc, wrem(·)). 2. Objective: Find an assignment A that minimizes L A. Sometimes the cost of running a task on a server only depends on the placement relation and its data locality, but not on the assignment of other tasks. Definition 9. A Hadoop cost function w is called uniform if wrem(r) = c for some constant c and all r ∈ N. A uniform HMR-system (UHMR-system) is an HMR-system (T, S, ρ, wloc, wrem(·)), where w is uniform. Problem 2 Uniform Hadoop Task Assignment Problem (UHTA) 1. Instance: A UHMR-system (T, S, ρ, wloc, wrem(·)). 2. Objective: Find an assignment A that minimizes L A. The number of replicas of each data block may be bounded, often by a small number such as 2 or 3. Definition 10. Call a placement graph G = ((T, S), E) jreplica-bounded if the degree of t is at most j for all t ∈ T. A j-replica-bounded-UHMR-system (j-UHMR-system) is a UHMR-system (T, S, ρ, wloc, wrem(·)), where Gρ is j-replicabounded. Problem 3 j-Uniform Hadoop Task Assignment Problem (j-UHTA) 1. Instance: A j-UHMR-system (T, S, ρ, wloc, wrem(·)). 2. Objective: Find an assignment A that minimizes L A. 3. HARDNESS OF TASK ASSIGNMENT In this section, we analyze the hardness of the various HTA optimization problems by showing the corresponding decision problems to be N P-complete
3.1 Task Assignment Decision Problems tasks lu1,lu2,lu3 and an auriliary task au.There is an edge between each of these tasks and the clause server.Since o Definition 11.Given a server capacity k,a task assign- contains a clauses,G contains a clause gadgets.Thus,G ment A is k-feasible if L4 k.An HMR-system is k- contains a clause servers,3o literal tasks and a auxiliary admissible if there exists a k-feasible task assignment tasks.Figure 1 describes the structure of the u-th clause gadget.We use circles and boxes to represent tasks and The decision problem corresponding to a class of HMR- servers,respectively systems and capacity k asks whether a given HMR-system in the class is k-admissible.Thus,the k-HTA problem asks about arbitrary HMR-systems,the k-UHTA problem asks about arbitrary UHMR-systems,and the k-j-UHTA prob- lem (which we write (j,k)-UHTA)asks about arbitrary j- UHMR-systems. 3.2 NP-completeness of(2,3)-UHTA The (2,3)-UHTA problem is a very restricted subclass of the general k-admissibility problem for HMR-systems.In this section,we restrict even further by taking wnoe =1 and wrem =3.This problem represents a simple scenario where the cost function assumes only the two possible values 1 and 3.each data block has at most 2 replicas,and each Figure 1:The structure of the u-th clause gadget. server has capacity 3.Despite its obvious simplicity,we show that (2,3)-UHTA is AP-complete.It follows that all of the The second type of gadget is called a variable gadget.Each less restritive decision problems are also NP-complete,and variable gadget contains 2w ring servers placed around a the correponding optimization problems do not have feasible circle.Let R denote the server at positionj1,w in solutions unless p=Ap ring i.Define the set Ti to be the servers in odd-numbered positions.Similarly,define the set Fi to be the servers in THEOREM 3.1.(2,3)-UHTA with costs wloe 1 and even-numbered positions.Between each pair of ring servers Wrem =3 is NP-complete. Rand we place a ring taskconected to its two The proof method is to construct a polynomial-time re- neighboring servers.To complete the circle,is connected duction from 3SAT to (2,3)-UHTA.Let g be the set of to R and R).There are also w variable tasks v:jE all 2-replica-bounded placement graphs.Given GeE g, [1,w]in ring i,but they do not connect to any ring server. we define the HMR-system MG =(T,S,p,wloe,wrem()), Since contains B variables,G contains B variable gadgets. where wioe =1 and wrem(r)=3 for all r.We say that Thus,G contains 2Bw ring servers,23w ring tasks and Bw G is 3-admissible if MG is 3-admissible.We construct a variable tasks.Figure 2 describes the structure of the i-th polynomial-time computable mapping f:3CNF-G,and variable gadget. show that a 3CNF formula o is satisfiable iff f(o)is 3- admissible.Ve shorten“3-admissible”to“admissible”in the following discussion. We first describe the construction of f.Let =CIA C2...AC be a 3CNF formula,where each Cu=(Vlu2V 2) lus)is a clause and each luv is a literal.Let 1,..,ra be the variables that appear in Therefore,contains exactly 3a instances of literals,each of which is either ri or i,where i [1,B].3 Let w be the maximum number of occurrences of any literal in o.Table 1 summarizes the parameters of o. Table 1:Parameters of the 3CNF clauses (Cu)a variables (vi)B literals (lu)3a max-occur of any literalw For example,in (1 V x2 V E3)A (1 V-4 V Is)A (T1 Vx4 V-Z6),we have a=3,B=6,and w=2 since Ti occurs twice. Given o,we construct the corresponding placement graph Figure 2:The structure of the i-th variable gadget. G which comprises several disjoint copies of the three types of gadget described below,connected together with addi- The third type of gadget is called a sink gadget.The tional edges. sink gadget contains a sink server P and three sink tasks The first type of gadget is called a clause gadget.Each pi,p2,P3.Each sink task is connected to the sink server. clause gadget u contains a clause server Cu,three literal G only contains one sink gadget.Figure 3 describes the 3The notation [a,b]in our discussion represents the set of structure of the sink gadget. integers fa,a+1,...,6-1,b. There are also some inter-gadget edges in G.We connect
3.1 Task Assignment Decision Problems Definition 11. Given a server capacity k, a task assignment A is k-feasible if L A ≤ k. An HMR-system is kadmissible if there exists a k-feasible task assignment. The decision problem corresponding to a class of HMRsystems and capacity k asks whether a given HMR-system in the class is k-admissible. Thus, the k-HTA problem asks about arbitrary HMR-systems, the k-UHTA problem asks about arbitrary UHMR-systems, and the k-j-UHTA problem (which we write (j, k)-UHTA) asks about arbitrary jUHMR-systems. 3.2 N P-completeness of (2,3)-UHTA The (2,3)-UHTA problem is a very restricted subclass of the general k-admissibility problem for HMR-systems. In this section, we restrict even further by taking wloc = 1 and wrem = 3. This problem represents a simple scenario where the cost function assumes only the two possible values 1 and 3, each data block has at most 2 replicas, and each server has capacity 3. Despite its obvious simplicity, we show that (2,3)-UHTA is N P-complete. It follows that all of the less restritive decision problems are also N P-complete, and the correponding optimization problems do not have feasible solutions unless P = N P. Theorem 3.1. (2, 3)-UHTA with costs wloc = 1 and wrem = 3 is N P-complete. The proof method is to construct a polynomial-time reduction from 3SAT to (2,3)-UHTA. Let G be the set of all 2-replica-bounded placement graphs. Given Gρ ∈ G, we define the HMR-system MG = (T, S, ρ, wloc, wrem(·)), where wloc = 1 and wrem(r) = 3 for all r. We say that G is 3-admissible if MG is 3-admissible. We construct a polynomial-time computable mapping f : 3CNF → G, and show that a 3CNF formula φ is satisfiable iff f(φ) is 3- admissible. We shorten “3-admissible” to “admissible” in the following discussion. We first describe the construction of f. Let φ = C1 ∧ C2 · · ·∧Cα be a 3CNF formula, where each Cu = (lu1∨lu2∨ lu3) is a clause and each luv is a literal. Let x1, · · · , xβ be the variables that appear in φ. Therefore, φ contains exactly 3α instances of literals, each of which is either xi or ¬xi, where i ∈ [1, β].3 Let ω be the maximum number of occurrences of any literal in φ. Table 1 summarizes the parameters of φ. Table 1: Parameters of the 3CNF φ clauses (Cu) α variables (vi) β literals (luv) 3α max-occur of any literal ω For example, in φ = (x1 ∨ x2 ∨ x3) ∧ (x1 ∨ ¬x4 ∨ x5) ∧ (¬x1 ∨ x4 ∨ ¬x6), we have α = 3, β = 6, and ω = 2 since x1 occurs twice. Given φ, we construct the corresponding placement graph G which comprises several disjoint copies of the three types of gadget described below, connected together with additional edges. The first type of gadget is called a clause gadget. Each clause gadget u contains a clause server Cu, three literal 3The notation [a,b] in our discussion represents the set of integers {a, a + 1, · · · , b − 1, b}. tasks lu1, lu2, lu3 and an auxiliary task au. There is an edge between each of these tasks and the clause server. Since φ contains α clauses, G contains α clause gadgets. Thus, G contains α clause servers, 3α literal tasks and α auxiliary tasks. Figure 1 describes the structure of the u-th clause gadget. We use circles and boxes to represent tasks and servers, respectively. Cu u1 lu2 lu3 l u a Figure 1: The structure of the u-th clause gadget. The second type of gadget is called a variable gadget. Each variable gadget contains 2ω ring servers placed around a circle. Let R (i) j denote the server at position j ∈ [1, 2ω] in ring i. Define the set Ti to be the servers in odd-numbered positions. Similarly, define the set Fi to be the servers in even-numbered positions. Between each pair of ring servers R (i) j and R (i) j+1, we place a ring task r (i) j connected to its two neighboring servers. To complete the circle, r (i) 2ω is connected to R (i) 2ω and R (i) 1 . There are also ω variable tasks v (i) j : j ∈ [1, ω] in ring i, but they do not connect to any ring server. Since φ contains β variables, G contains β variable gadgets. Thus, G contains 2βω ring servers, 2βω ring tasks and βω variable tasks. Figure 2 describes the structure of the i-th variable gadget. ( ) 1 i R ( ) 2 i r ( ) 2 i R ( ) 3 i R ( ) 2 i R ( ) 1 i r ( ) 2 i r ( ) 1 i v ( ) 2 i v (i) v Ti Fi Ti Ti Ti Fi Fi Fi Figure 2: The structure of the i-th variable gadget. The third type of gadget is called a sink gadget. The sink gadget contains a sink server P and three sink tasks p1, p2, p3. Each sink task is connected to the sink server. G only contains one sink gadget. Figure 3 describes the structure of the sink gadget. There are also some inter-gadget edges in G. We connect
auxiliary task),and each task is local to C.Thus,the load is at most 3.The sink server is assigned three local sink tasks and the load is exactly 3.Therefore.all constraints are satisfied and A is feasible.This completes the proof of Lemma3.4.▣ The proof of the converse of Lemma 3.4 is more involved. The method is given a feasible assignment A in G=f(o),we Figure 3:The structure of the sink gadget. first construct a feasible assignment B in G such that B(t) P for all t ET-p1,p2,ps}.Then we remove the sink tasks and the sink server from further consideration and consider each variable task to the sink server P.We also connect the resulting graph G.After that,we partition G into two each literal task to a unique ring server R.To be more subgraphs,and construct a feasible assignment B'such that no tasks from one partition are remotely assigned to servers precise,if literal luv is the j-th occurrence of xi in o,connect the literal task to ring serverTi;if literal is in the other partition.This step involves a case analysis. Finally,a natural way of constructing the satisfying truth the j-th occurrence of r;in o,connect the literal task lu assignment for o follows to ring serverRFThese inter-gadget edges complete the graph G.Table 2 summarizes the parameters of G. LEMMA 3.5.Let A be a feasible task assignment.Then there erists a feasible task assignment B such that B(t)P Table 2:Parameters of the HMR-graph G for all t ET-{p1,p2,P3). clause server C literal task lu 3a PROOF.When A satisfies that A(t)≠P for all t∈T auxiliary task au ring server R 23w {p1,p2,p3},let B=A.Otherwise,assume there exists a ring taskr 26w variable task v Bw task t'such that A(t')=P and t'ET-{p1,p2,p3}.Since sink server P 1 sink task pj 3 the capacity of P is 3,there is at least one sink task,say pi, is not assigned to P.Let A(p1)=Q.Since p(p1,Q)does not hold,Q has only been assigned pi and L=3.Let LEMMA 3.2.For any E 3CNF,the graph f()is 2- B(p1)=P and B(t')=Q.Repeat the same process for all replica-bounded. tasks other than pi,p2,ps that are assigned to P in A.Then let B(t)=A(t)for the remaining tasks t E T.To see B is PRooF.We count the number of edges from each task feasible,,note that L≤La≤3 for all servers s∈S.口 node in f().Each clause task has 2 edges,each auxiliary task has 1 edge,each ring task has 2 edges,each variable Let G'be the subgraph induced by (T-{pi,p2,ps},S- task has 1 edge,and each sink task has 1 edge.Therefore. (P))=(T,S').We have the following lemma. f()is 2-replica-bounded. The following lemma is immediate. LEMMA 3.6.Let A be a feasible task assignment in G. Then there erists a feasible task assignment A'in G. LEMMA3.3.The mapping∫:3CNF一G is polynomial-- PROOF.Given A,Lemma 3.5 tells us that there exists time computable. another feasible assignment B in G such that B(t)P for LEMMA 3.4.If o is satisfiable,then G=f(o)is admis- all t E T.Let A'(t)=B(t)for all t E T'.Then A'is an sible. assignment in G since A'(t)ES-PI for all t E T.To see A'is feasible,note that L'L 3 for all servers PROOF.Let o be a satisfying truth assignment for o,and s∈S.☐ we construct a feasible assignment A in G=f().First of all,assign each sink task to the sink server,i.e.,let A(pi)= P for all ie[1,3].Then assign each auxiliary task au to the We further partition G'into two subgraphs Gc and GR. clause server Cu;i.e.,let A(au)=Cu for all u [1,a].If Gc is induced by nodes {Cu:uE [1,a]}ufau uE [1,a]}U a()=true,then assign ring tasks je[1,w to ring {luv u e [1,a],v E [1,3]}and GR is induced by nodes servers in Ti,variable tasksj[]to ring servers in {o:ieL,3,j∈[1,2}U{r:i∈[1,g,je1,2}U F.If a()=false,then assign ring tasksj1w u):i[1,B],j [1,w]).In other words,Gc consists of all clause gadgets while GR consists of all variable gadgets. to ring servers in Fi,variable tasks v:j[1,w]to ring If a task in one partition is remotely assigned to a server in servers in Ti.If literal luv =i and o(ti)=true,then assign the other partition,we call this task a cross-boundary task. task luv to its local ring server in Ti.If literal luv =i and Let ne be the number of cross-boundary tasks that are in o(ri)=false,then assign task l to its local ring server in Gc and assigned to servers in GR by A,n be the number of Fi.Otherwise,assign task lue to its local clause server Cu. cross-boundary tasks that are in GR and assigned to servers We then check this task assignment is feasible.Each ring in Gc by A.We have the following lemmas server is assigned either at most three local tasks (two ring tasks and one literal task),or one remote variable task.In LEMMA 3.7.Let A be a feasible assignment in G'such either case,the load does not exceed the capacity 3.The that n0 and n0.Then there erist a feasible assign- number of tasks assigned to each clause server Cu is exactly ment B in G'such that one of ne and nf equals nenA the number of false literals in Cu under o plus one (the and the other one equals 0
P 1 p 2 p 3 p Figure 3: The structure of the sink gadget. each variable task v (i) j to the sink server P. We also connect each literal task luv to a unique ring server R (i) j . To be more precise, if literal luv is the j-th occurrence of xi in φ, connect the literal task luv to ring server R (i) 2j−1 ∈ Ti; if literal luv is the j-th occurrence of ¬xi in φ, connect the literal task luv to ring server R (i) 2j ∈ Fi. These inter-gadget edges complete the graph G. Table 2 summarizes the parameters of G. Table 2: Parameters of the HMR-graph G clause server Cu α literal task luv 3α auxiliary task au α ring server R (i) j 2βω ring task r (i) j 2βω variable task v (i) j βω sink server P 1 sink task pj 3 Lemma 3.2. For any φ ∈ 3CNF, the graph f(φ) is 2- replica-bounded. Proof. We count the number of edges from each task node in f(φ). Each clause task has 2 edges, each auxiliary task has 1 edge, each ring task has 2 edges, each variable task has 1 edge, and each sink task has 1 edge. Therefore, f(φ) is 2-replica-bounded. The following lemma is immediate. Lemma 3.3. The mapping f : 3CNF → G is polynomialtime computable. Lemma 3.4. If φ is satisfiable, then G = f(φ) is admissible. Proof. Let σ be a satisfying truth assignment for φ, and we construct a feasible assignment A in G = f(φ). First of all, assign each sink task to the sink server, i.e., let A(pi) = P for all i ∈ [1, 3]. Then assign each auxiliary task au to the clause server Cu, i.e., let A(au) = Cu for all u ∈ [1, α]. If σ(xi) = true, then assign ring tasks r (i) j : j ∈ [1, 2ω] to ring servers in Ti, variable tasks v (i) j : j ∈ [1, ω] to ring servers in Fi. If σ(xi) = false, then assign ring tasks r (i) j : j ∈ [1, 2ω] to ring servers in Fi, variable tasks v (i) j : j ∈ [1, ω] to ring servers in Ti. If literal luv = xi and σ(xi) = true, then assign task luv to its local ring server in Ti. If literal luv = ¬xi and σ(xi) = false, then assign task luv to its local ring server in Fi. Otherwise, assign task luv to its local clause server Cu. We then check this task assignment is feasible. Each ring server is assigned either at most three local tasks (two ring tasks and one literal task), or one remote variable task. In either case, the load does not exceed the capacity 3. The number of tasks assigned to each clause server Cu is exactly the number of false literals in Cu under σ plus one (the auxiliary task), and each task is local to Cu. Thus, the load is at most 3. The sink server is assigned three local sink tasks and the load is exactly 3. Therefore, all constraints are satisfied and A is feasible. This completes the proof of Lemma 3.4. The proof of the converse of Lemma 3.4 is more involved. The method is given a feasible assignment A in G = f(φ), we first construct a feasible assignment B in G such that B(t) 6= P for all t ∈ T − {p1, p2, p3}. Then we remove the sink tasks and the sink server from further consideration and consider the resulting graph G 0 . After that, we partition G 0 into two subgraphs, and construct a feasible assignment B 0 such that no tasks from one partition are remotely assigned to servers in the other partition. This step involves a case analysis. Finally, a natural way of constructing the satisfying truth assignment for φ follows. Lemma 3.5. Let A be a feasible task assignment. Then there exists a feasible task assignment B such that B(t) 6= P for all t ∈ T − {p1, p2, p3}. Proof. When A satisfies that A(t) 6= P for all t ∈ T − {p1, p2, p3}, let B = A. Otherwise, assume there exists a task t 0 such that A(t 0 ) = P and t 0 ∈ T − {p1, p2, p3}. Since the capacity of P is 3, there is at least one sink task, say p1, is not assigned to P. Let A(p1) = Q. Since ρ(p1, Q) does not hold, Q has only been assigned p1 and L A Q = 3. Let B(p1) = P and B(t 0 ) = Q. Repeat the same process for all tasks other than p1, p2, p3 that are assigned to P in A. Then let B(t) = A(t) for the remaining tasks t ∈ T. To see B is feasible, note that L B s ≤ L A s ≤ 3 for all servers s ∈ S. Let G 0 be the subgraph induced by (T − {p1, p2, p3}, S − {P}) = (T 0 , S0 ). We have the following lemma. Lemma 3.6. Let A be a feasible task assignment in G. Then there exists a feasible task assignment A 0 in G 0 . Proof. Given A, Lemma 3.5 tells us that there exists another feasible assignment B in G such that B(t) 6= P for all t ∈ T 0 . Let A 0 (t) = B(t) for all t ∈ T 0 . Then A 0 is an assignment in G 0 since A 0 (t) ∈ S − {P} for all t ∈ T 0 . To see A 0 is feasible, note that L A0 s ≤ L B s ≤ 3 for all servers s ∈ S 0 . We further partition G 0 into two subgraphs GC and GR. GC is induced by nodes {Cu : u ∈ [1, α]}∪ {au : u ∈ [1, α]}∪ {luv : u ∈ [1, α], v ∈ [1, 3]} and GR is induced by nodes {R (i) j : i ∈ [1, β], j ∈ [1, 2ω]} ∪ {r (i) j : i ∈ [1, β], j ∈ [1, 2ω]} ∪ {v (i) j : i ∈ [1, β], j ∈ [1, ω]}. In other words, GC consists of all clause gadgets while GR consists of all variable gadgets. If a task in one partition is remotely assigned to a server in the other partition, we call this task a cross-boundary task. Let n A c be the number of cross-boundary tasks that are in GC and assigned to servers in GR by A, n A r be the number of cross-boundary tasks that are in GR and assigned to servers in GC by A. We have the following lemmas. Lemma 3.7. Let A be a feasible assignment in G 0 such that n A c > 0 and n A r > 0. Then there exist a feasible assignment B in G 0 such that one of n B c and n B r equals |n A c − n A r | and the other one equals 0