当前位置：和泉文库 > 计算机 > 浏览文档

《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）Data-Intensive Text Processing（MapReduce book 20100307）

1 Introduction 2 MapReduce Basics 3 MapReduce algorithm design 4 Inverted Indexing for Text Retrieval 5 Graph Algorithms 6 EM Algorithms for Text Processing 7 Closing Remarks

文件格式：PDF，文件大小：1.56MB，售价：31.32元

共152页，可试读30页，点击往前阅读 ↑↑

文档详细内容（约152页）

1.1.COMPUTING IN THE CLOUDS 7 Here we offer up our own thoughts and attempt to explain how cloud computing relates to MapReduce and data-intensive processing. At the most superficial level,everything that used to be called web applications has been rebranded to become "cloud applications",which includes what we have previously called web 2.0 sites.In fact,anything running inside a browser that gathers and stores user-generated content now qualifies as an example of cloud computing.This includes social-networking services such as Facebook,video-sharing sites such as YouTube,web- based email services such as Gmail,and applications such as Google Docs.In this context.the cloud simply refers to the servers that power these sites,and user data is said to reside "in the cloud".The accumulation of vast quantities of user data creates large-data problems,many of which are suitable for MapReduce.To give two concrete examples:a social-networking site analyzes connections in the enormous globe-spanning graph of friendships to recommend new connections.An online email service analyzes messages and user behavior to optimize ad selection and placement.These are all large- data problems that have been tackled with MapReduce.9 Another important facet of cloud computing is what's more precisely known as utility computing [91,21].As the name implies,the idea behind utility computing is to treat computing resource as a metered service,like electricity or natural gas.The idea harkens back to the days of time-sharing machines,and in truth isn't very different from this antiquated form of computing.Under this model,a "cloud user"can dynamically provision any amount of computing resources from a "cloud provider"on demand and only pay for what is consumed.In practical terms,the user is paying for access to virtual machine instances that run a standard operating system such as Linux.Virtualization technology (e.g.,[9])is used by the cloud provider to allocate available physical resources and enforce isolation between multiple users that may be sharing the same hardware Once one or more virtual machine instances have been provisioned,the user has full control over the resources and can use them for arbitrary computation.Virtual machines that are no longer needed are destroyed,thereby freeing up physical resources that can be redirected to other users.Resource consumption is measured in some equivalent of machine-hours and users are charged in increments thereof. Both users and providers benefit in the utility computing model.Users are freed from upfront capital investments necessary to build datacenters and substantial reoccur- ring costs in maintaining them.They also gain the important property of elasticity-as demand for computing resources grow,for example,from an unpredicted spike in cus- tomers,more resources can be seamlessly allocated from the cloud without an inter- ruption in service.As demand falls,provisioned resources can be released.Prior to the advent of utility computing,coping with unexpected spikes in demand was fraught with 9The first example is Facebook,a well-known user of Hadoop,in exactly the manner as described [46].The second is,of course,Google,which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and placement

1.1. COMPUTING IN THE CLOUDS 7 Here we offer up our own thoughts and attempt to explain how cloud computing relates to MapReduce and data-intensive processing. At the most superficial level, everything that used to be called web applications has been rebranded to become “cloud applications”, which includes what we have previously called web 2.0 sites. In fact, anything running inside a browser that gathers and stores user-generated content now qualifies as an example of cloud computing. This includes social-networking services such as Facebook, video-sharing sites such as YouTube, webbased email services such as Gmail, and applications such as Google Docs. In this context, the cloud simply refers to the servers that power these sites, and user data is said to reside “in the cloud”. The accumulation of vast quantities of user data creates large-data problems, many of which are suitable for MapReduce. To give two concrete examples: a social-networking site analyzes connections in the enormous globe-spanning graph of friendships to recommend new connections. An online email service analyzes messages and user behavior to optimize ad selection and placement. These are all largedata problems that have been tackled with MapReduce.9 Another important facet of cloud computing is what’s more precisely known as utility computing [91, 21]. As the name implies, the idea behind utility computing is to treat computing resource as a metered service, like electricity or natural gas. The idea harkens back to the days of time-sharing machines, and in truth isn’t very different from this antiquated form of computing. Under this model, a “cloud user” can dynamically provision any amount of computing resources from a “cloud provider” on demand and only pay for what is consumed. In practical terms, the user is paying for access to virtual machine instances that run a standard operating system such as Linux. Virtualization technology (e.g., [9]) is used by the cloud provider to allocate available physical resources and enforce isolation between multiple users that may be sharing the same hardware. Once one or more virtual machine instances have been provisioned, the user has full control over the resources and can use them for arbitrary computation. Virtual machines that are no longer needed are destroyed, thereby freeing up physical resources that can be redirected to other users. Resource consumption is measured in some equivalent of machine-hours and users are charged in increments thereof. Both users and providers benefit in the utility computing model. Users are freed from upfront capital investments necessary to build datacenters and substantial reoccurring costs in maintaining them. They also gain the important property of elasticity—as demand for computing resources grow, for example, from an unpredicted spike in customers, more resources can be seamlessly allocated from the cloud without an interruption in service. As demand falls, provisioned resources can be released. Prior to the advent of utility computing, coping with unexpected spikes in demand was fraught with 9The first example is Facebook, a well-known user of Hadoop, in exactly the manner as described [46]. The second is, of course, Google, which uses MapReduce to continuously improve existing algorithms and to devise new algorithms for ad selection and placement

8 CHAPTER 1.INTRODUCTION challenges:under-provision and run the risk of service interruptions,or over-provision and tie up precious capital in idle machines that are depreciating. From the utility provider point of view,this business also makes sense because large datacenters benefit from economies of scale and can be run more efficiently than smaller datacenters.In the same way that insurance works by aggregating risk and re- distributing it,utility providers aggregate the computing demands for a large number of users.Although demand may fluctuate significantly for each user,overall trends in aggregate demand should be smooth and predictable,which allows the cloud provider to adjust capacity over time with less risk of either offering too much (resulting in in- efficient use of capital)or too little (resulting in unsatisfied customers).In the world of utility computing,Amazon Web Services currently leads the way and remains the dominant player,but a number of other cloud providers populate a market that is be- coming increasingly crowded.Most systems are based on proprietary infrastructure,but there is at least one,Eucalyptus [78],that is available open source.Increased competi- tion will benefit cloud users,but what direct relevance does this have for MapReduce? The connection is quite simple:processing large amounts of data with MapReduce re- quires access to clusters with sufficient capacity.However,not everyone with large-data problems can afford to purchase and maintain clusters.This is where utility computing comes in:clusters of sufficient size can be provisioned only when the need arises,and users pay only as much as is required to solve their problems.This lowers the barrier to entry for data-intensive processing and makes MapReduce much more accessible. A generalization of the utility computing concept is "everything as a service", which is itself a new take on the age-old idea of outsourcing.A cloud provider offer- ing customers access to virtual machine instances is said to be offering infrastructure as a service,or IaaS for short.However,this may be too low level for many users. Enter platform as a service (PaaS),which is a rebranding of what used to be called hosted services in the "pre-cloud"era.Platform is used generically to refer to any set of well-defined services on top of which users can build applications,deploy content, etc.This class of services is best exemplified by Google App Engine,which provides the backend datastore and API for anyone to build highly-scalable web applications. Google maintains the infrastructure,freeing the user from having to backup,upgrade, patch,or otherwise maintain basic services such as the storage layer or the programming environment.At an even higher level,cloud providers can offer software as a service (SaaS),as exemplified by Salesforce,a leader in customer relationship management (CRM)software.Other examples include outsourcing an entire organization's email to a third party,which is commonplace today.What does this proliferation of service have to do with MapReduce?No doubt that "everything as a service"is driven by desires for greater business efficiencies,but scale and elasticity play important roles as well.The cloud allows seamless expansion of operations without the need for careful planning and

8 CHAPTER 1. INTRODUCTION challenges: under-provision and run the risk of service interruptions, or over-provision and tie up precious capital in idle machines that are depreciating. From the utility provider point of view, this business also makes sense because large datacenters benefit from economies of scale and can be run more efficiently than smaller datacenters. In the same way that insurance works by aggregating risk and redistributing it, utility providers aggregate the computing demands for a large number of users. Although demand may fluctuate significantly for each user, overall trends in aggregate demand should be smooth and predictable, which allows the cloud provider to adjust capacity over time with less risk of either offering too much (resulting in inefficient use of capital) or too little (resulting in unsatisfied customers). In the world of utility computing, Amazon Web Services currently leads the way and remains the dominant player, but a number of other cloud providers populate a market that is becoming increasingly crowded. Most systems are based on proprietary infrastructure, but there is at least one, Eucalyptus [78], that is available open source. Increased competition will benefit cloud users, but what direct relevance does this have for MapReduce? The connection is quite simple: processing large amounts of data with MapReduce requires access to clusters with sufficient capacity. However, not everyone with large-data problems can afford to purchase and maintain clusters. This is where utility computing comes in: clusters of sufficient size can be provisioned only when the need arises, and users pay only as much as is required to solve their problems. This lowers the barrier to entry for data-intensive processing and makes MapReduce much more accessible. A generalization of the utility computing concept is “everything as a service”, which is itself a new take on the age-old idea of outsourcing. A cloud provider offering customers access to virtual machine instances is said to be offering infrastructure as a service, or IaaS for short. However, this may be too low level for many users. Enter platform as a service (PaaS), which is a rebranding of what used to be called hosted services in the “pre-cloud” era. Platform is used generically to refer to any set of well-defined services on top of which users can build applications, deploy content, etc. This class of services is best exemplified by Google App Engine, which provides the backend datastore and API for anyone to build highly-scalable web applications. Google maintains the infrastructure, freeing the user from having to backup, upgrade, patch, or otherwise maintain basic services such as the storage layer or the programming environment. At an even higher level, cloud providers can offer software as a service (SaaS), as exemplified by Salesforce, a leader in customer relationship management (CRM) software. Other examples include outsourcing an entire organization’s email to a third party, which is commonplace today. What does this proliferation of service have to do with MapReduce? No doubt that “everything as a service” is driven by desires for greater business efficiencies, but scale and elasticity play important roles as well. The cloud allows seamless expansion of operations without the need for careful planning and

1.2.BIG IDEAS 9 supports scales that may otherwise be difficult or cost-prohibitive for an organization to achieve. Finally,cloud services,just like MapReduce,represents the search for an appro- priate level of abstraction and beneficial divisions of labor.IaaS is an abstraction over raw physical hardware-an organization might lack the capital,expertise,or interest in running datacenters,and therefore pays a cloud provider to do so on its behalf. The argument applies similarly to PaaS and SaaS.In the same vein,the MapReduce programming model is a powerful abstraction that separates the what from the how of data-intensive processing. 1.2 BIG IDEAS Tackling large-data problems requires a distinct approach that sometimes runs counter to traditional models of computing.In this section,we discuss a number of "big ideas" behind MapReduce.To be fair,all of these ideas have been discussed in the computer science literature for some time (some for decades),and MapReduce is certainly not the first to adopt these approaches.Nevertheless,the engineers at Google deserve tremendous credit for pulling these various threads together and demonstrating the power of these ideas on a scale previously unheard of. Scale“out”,not“up”.For data-intensive workloads,a large number of commodity low-end servers (i.e.,the scaling "out"approach)is preferred over a small number of high-end servers (i.e.,the scaling "up"approach).The latter approach of purchasing symmetric multi-processing (SMP)machines with a large number of processor sockets (dozens,even hundreds)and a large amount of shared memory (hundreds or even thou- sands of gigabytes)is not cost effective,since the costs of such machines do not scale linearly (i.e.,a machine with twice as many processors is often significantly more than twice as expensive).On the other hand,the low-end server market overlaps with the high-volume desktop computing market,which has the effect of keeping prices low due to competition,interchangeable components,and economies of scale. Barroso and Holzle's recent treatise of what they dubbed "warehouse-scale com- puters"[10]contains a thoughtful analysis of the two approaches.The Transaction Processing Council (TPC)is a neutral,non-profit organization whose mission is to establish objective database benchmarks.Benchmark data submitted to that organi- zation are probably the closest one can get to a fair "apples-to-apples"comparison of cost and performance for specific,well-defined relational processing applications.Based on TPC-C benchmarks results from late 2007,a low-end server platform is about four times more cost efficient than a high-end shared memory platform from the same ven- dor.Excluding storage costs,the price/performance advantage of the low-end server increases to about a factor of twelve

1.2. BIG IDEAS 9 supports scales that may otherwise be difficult or cost-prohibitive for an organization to achieve. Finally, cloud services, just like MapReduce, represents the search for an appropriate level of abstraction and beneficial divisions of labor. IaaS is an abstraction over raw physical hardware—an organization might lack the capital, expertise, or interest in running datacenters, and therefore pays a cloud provider to do so on its behalf. The argument applies similarly to PaaS and SaaS. In the same vein, the MapReduce programming model is a powerful abstraction that separates the what from the how of data-intensive processing. 1.2 BIG IDEAS Tackling large-data problems requires a distinct approach that sometimes runs counter to traditional models of computing. In this section, we discuss a number of “big ideas” behind MapReduce. To be fair, all of these ideas have been discussed in the computer science literature for some time (some for decades), and MapReduce is certainly not the first to adopt these approaches. Nevertheless, the engineers at Google deserve tremendous credit for pulling these various threads together and demonstrating the power of these ideas on a scale previously unheard of. Scale “out”, not “up”. For data-intensive workloads, a large number of commodity low-end servers (i.e., the scaling “out” approach) is preferred over a small number of high-end servers (i.e., the scaling “up” approach). The latter approach of purchasing symmetric multi-processing (SMP) machines with a large number of processor sockets (dozens, even hundreds) and a large amount of shared memory (hundreds or even thousands of gigabytes) is not cost effective, since the costs of such machines do not scale linearly (i.e., a machine with twice as many processors is often significantly more than twice as expensive). On the other hand, the low-end server market overlaps with the high-volume desktop computing market, which has the effect of keeping prices low due to competition, interchangeable components, and economies of scale. Barroso and H¨olzle’s recent treatise of what they dubbed “warehouse-scale computers” [10] contains a thoughtful analysis of the two approaches. The Transaction Processing Council (TPC) is a neutral, non-profit organization whose mission is to establish objective database benchmarks. Benchmark data submitted to that organization are probably the closest one can get to a fair “apples-to-apples” comparison of cost and performance for specific, well-defined relational processing applications. Based on TPC-C benchmarks results from late 2007, a low-end server platform is about four times more cost efficient than a high-end shared memory platform from the same vendor. Excluding storage costs, the price/performance advantage of the low-end server increases to about a factor of twelve

10 CHAPTER 1.INTRODUCTION What if we take into account the fact that communication between nodes in a high-end SMP machine is orders of magnitude faster than communication between nodes in a commodity network-based cluster?Since workloads today are beyond the capability of any single machine (no matter how powerful),the comparison is more accurately between a smaller cluster of high-end machines and a larger cluster of low-end machines (network communication is unavoidable in both cases).Barroso and Holzle model these two approaches under workloads that demand more or less communication,and conclude that a cluster of low-end servers approaches the performance of the equivalent cluster of high-end servers-the small performance gap is insufficient to justify the price premium of the high-end servers.For data-intensive applications,the conclusion is clear:scaling“out”is superior to scaling“up”,and MapReduce is explicitly designed around clusters of commodity low-end servers. Move processing to the data.In traditional high-performance computing(HPC) applications (e.g.,for climate or nuclear simulations),it is commonplace for a su- percomputer to have“processing nodes”and“storage nodes'”linked together by a high-capacity interconnect.Many data-intensive workloads are not very processor- demanding,which means that the separation of compute and storage creates a bottleneck in the network.As an alternative to moving data around,it is more efficient to move the processing around.That is,MapReduce assumes an architecture where processors and storage (disk)are co-located.In such a setup,we can take advantage of data locality by running code on the processor directly attached to the block of data we need.The distributed file system is responsible for managing the data over which MapReduce operates. Process data sequentially and avoid random access.Data-intensive processing by definition means that the relevant datasets are too large to fit in memory and must be held on disk.Seek times for random disk access are fundamentally limited by the me- chanical nature of the devices:read heads can only move so fast,and platters can only spin so rapidly.As a result,it is desirable to avoid random data access,and instead orga- nize computations so that data is processed sequentially.A simple scenario10 poignantly illustrates the large performance gap between sequential operations and random seeks: assume a 1 terabyte database containing 1010 100-byte records.Given reasonable as- sumptions about disk latency and throughput,a back-of-the-envelop calculation will show that updating 1%of the the records(by accessing and then mutating each record) will take about a month on a single machine.On the other hand,if one simply reads the entire database and rewrites all the records (mutating those that need updating),the 10Adapted from a post by Ted Dunning on the Hadoop mailing list

10 CHAPTER 1. INTRODUCTION What if we take into account the fact that communication between nodes in a high-end SMP machine is orders of magnitude faster than communication between nodes in a commodity network-based cluster? Since workloads today are beyond the capability of any single machine (no matter how powerful), the comparison is more accurately between a smaller cluster of high-end machines and a larger cluster of low-end machines (network communication is unavoidable in both cases). Barroso and H¨olzle model these two approaches under workloads that demand more or less communication, and conclude that a cluster of low-end servers approaches the performance of the equivalent cluster of high-end servers—the small performance gap is insufficient to justify the price premium of the high-end servers. For data-intensive applications, the conclusion is clear: scaling “out” is superior to scaling “up”, and MapReduce is explicitly designed around clusters of commodity low-end servers. Move processing to the data. In traditional high-performance computing (HPC) applications (e.g., for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes” and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloads are not very processordemanding, which means that the separation of compute and storage creates a bottleneck in the network. As an alternative to moving data around, it is more efficient to move the processing around. That is, MapReduce assumes an architecture where processors and storage (disk) are co-located. In such a setup, we can take advantage of data locality by running code on the processor directly attached to the block of data we need. The distributed file system is responsible for managing the data over which MapReduce operates. Process data sequentially and avoid random access. Data-intensive processing by definition means that the relevant datasets are too large to fit in memory and must be held on disk. Seek times for random disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast, and platters can only spin so rapidly. As a result, it is desirable to avoid random data access, and instead organize computations so that data is processed sequentially. A simple scenario10 poignantly illustrates the large performance gap between sequential operations and random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonable assumptions about disk latency and throughput, a back-of-the-envelop calculation will show that updating 1% of the the records (by accessing and then mutating each record) will take about a month on a single machine. On the other hand, if one simply reads the entire database and rewrites all the records (mutating those that need updating), the 10Adapted from a post by Ted Dunning on the Hadoop mailing list

1.2.BIG IDEAS 11 process would finish in under a work day on a single machine.Sequential data access is,literally,orders of magnitude faster than random data access.11 The development of solid-state drives is unlikely the change this balance for at least two reasons.First,the cost differential between traditional magnetic disks and solid-state disks remains substantial:large-data will for the most part remain on me- chanical drives,at least in the near future.Second,although solid-state disks have substantially faster seek times,order-of-magnitude differences in performance between sequential and random access still remain. MapReduce is primarily designed for batch processing over large datasets.To the extent possible,all computations are organized into long streaming operations that take advantage of the aggregate bandwidth of many disks in a cluster.Many aspects of MapReduce's design explicitly trade latency for throughput. Hide system-level details from the application developer.According to many guides on the practice of software engineering written by experienced industry profes- sionals,one of the key reasons why writing code is difficult is because the programmer must simultaneously keep track of many details in short term memory all at once- ranging from the mundane (e.g.,variable names)to the sophisticated (e.g.,a corner case of an algorithm that requires special treatment).This imposes a high cognitive load and requires intense concentration,which leads to a number of recommendations about a programmer's environment(e.g.,quiet office,comfortable furniture,large moni- tors,etc.).The challenges in writing distributed software are greatly compounded-the programmer must manage details across several threads,processes,or machines.Of course,the biggest headache in distributed programming is that code runs concurrently in unpredictable orders,accessing data in unpredictable patterns.This gives rise to race conditions,deadlocks,and other well-known problems.Programmers are taught to use low-level devices such as mutexes and to apply high-level "design patterns"such as producer-consumer queues to tackle these challenges,but the truth remains:concurrent programs are notoriously difficult to reason about and even harder to debug MapReduce addresses the challenges of distributed programming by providing an abstraction that isolates the developer from system-level details (e.g.,locking of data structures,data starvation issues in the processing pipeline,etc.).The program- ming model specifies simple and well-defined interfaces between a small number of components,and therefore is easy for the programmer to reason about.MapReduce maintains a separation of what computations are to be performed and how those computations are actually carried out on a cluster of machines.The first is under the control of the programmer,while the second is exclusively the responsibility of the execution framework or "runtime".The advantage is that the execution framework only needs to be designed once and verified for correctness-thereafter,as long as the For more detail,Jacobs [53]provides real-world benchmarks in his discussion of large-data problems

1.2. BIG IDEAS 11 process would finish in under a work day on a single machine. Sequential data access is, literally, orders of magnitude faster than random data access.11 The development of solid-state drives is unlikely the change this balance for at least two reasons. First, the cost differential between traditional magnetic disks and solid-state disks remains substantial: large-data will for the most part remain on mechanical drives, at least in the near future. Second, although solid-state disks have substantially faster seek times, order-of-magnitude differences in performance between sequential and random access still remain. MapReduce is primarily designed for batch processing over large datasets. To the extent possible, all computations are organized into long streaming operations that take advantage of the aggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly trade latency for throughput. Hide system-level details from the application developer. According to many guides on the practice of software engineering written by experienced industry professionals, one of the key reasons why writing code is difficult is because the programmer must simultaneously keep track of many details in short term memory all at once— ranging from the mundane (e.g., variable names) to the sophisticated (e.g., a corner case of an algorithm that requires special treatment). This imposes a high cognitive load and requires intense concentration, which leads to a number of recommendations about a programmer’s environment (e.g., quiet office, comfortable furniture, large monitors, etc.). The challenges in writing distributed software are greatly compounded—the programmer must manage details across several threads, processes, or machines. Of course, the biggest headache in distributed programming is that code runs concurrently in unpredictable orders, accessing data in unpredictable patterns. This gives rise to race conditions, deadlocks, and other well-known problems. Programmers are taught to use low-level devices such as mutexes and to apply high-level “design patterns” such as producer–consumer queues to tackle these challenges, but the truth remains: concurrent programs are notoriously difficult to reason about and even harder to debug. MapReduce addresses the challenges of distributed programming by providing an abstraction that isolates the developer from system-level details (e.g., locking of data structures, data starvation issues in the processing pipeline, etc.). The programming model specifies simple and well-defined interfaces between a small number of components, and therefore is easy for the programmer to reason about. MapReduce maintains a separation of what computations are to be performed and how those computations are actually carried out on a cluster of machines. The first is under the control of the programmer, while the second is exclusively the responsibility of the execution framework or “runtime”. The advantage is that the execution framework only needs to be designed once and verified for correctness—thereafter, as long as the 11For more detail, Jacobs [53] provides real-world benchmarks in his discussion of large-data problems

点击进入文档下载页（PDF格式）

共152页，可试读30页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录