当前位置：和泉文库 > 计算机 > 浏览文档

《大规模数据处理——云计算 Mass Data Processing Cloud Computing》课程教学资源（阅读材料）Data-Intensive Text Processing（MapReduce book 20100307）

1 Introduction 2 MapReduce Basics 3 MapReduce algorithm design 4 Inverted Indexing for Text Retrieval 5 Graph Algorithms 6 EM Algorithms for Text Processing 7 Closing Remarks

文件格式：PDF，文件大小：1.56MB，售价：31.32元

文档详细内容（约152页）

ii Contents Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Computing in the Clouds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6 1.2 Big Ideas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 1.3 Why is this different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4 What this book is not. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 MapReduce Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1 Functional programming roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Mappers and reducers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 The Execution Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Partitioners and combiners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.5 The distributed file system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Hadoop Cluster Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3 MapReduce algorithm design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1 Local Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.2 Pairs and Stripes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Computing Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Secondary Sorting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .54 3.5 Relational Joins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Inverted Indexing for Text Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.1 Inverted Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Inverted Indexing: Baseline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

CONTENTS iii 4.3 Inverted Indexing: Revised Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Index Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.1 Byte-Aligned Codes 70 4.4.2 Bit-Aligned Codes 71 4.4.3 Postings Compression 73 4.5 What about retrieval?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5 Graph Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 5.1 Graph Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Parallel Breadth-First Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 5.3 PageRank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Issues with Graph Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6 EM Algorithms for Text Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 Expectation maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.1 Maximum likelihood estimation 98 6.1.2 A latent variable marble game 100 6.1.3 MLE with latent variables 101 6.1.4 Expectation maximization 102 6.1.5 An EM example 103 6.2 Hidden Markov models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104 6.2.1 Three questions for hidden Markov models 105 6.2.2 The forward algorithm 107 6.2.3 The Viterbi algorithm 108 6.2.4 Parameter estimation for HMMs 110 6.2.5 Forward-backward training: summary 115 6.3 EM in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.1 HMM training in MapReduce 117 6.4 Case study: word alignment for statistical machine translation. . . . . . . .118

CHAPTER Introduction MapReduce [31]is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers.It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop,whose development was led by Yahoo (now an Apache project).Today, a vibrant software ecosystem has sprung up around Hadoop,with significant activity in both industry and academia. This book is about scalable approaches to processing large amounts of text with MapReduce.Given this focus,it makes sense to start with the most basic question: Why?There are many answers to this question,but we focus on two.First,"big data" is a fact of the world,and therefore an issue that real-world systems must grapple with. Second,across a wide range of text processing applications,more data translates into more effective algorithms,and thus it makes sense to take advantage of the plentiful amounts of data that surround us. Modern information societies are defined by vast repositories of data,both public and private.Therefore,any practical application must be able to scale up to datasets of interest.For many,this means scaling up to the web,or at least a non-trivial frac- tion thereof.Any organization built around gathering,analyzing,monitoring,filtering, searching,or organizing web content must tackle large-data problems:"web-scale"pro- cessing is practically synonymous with data-intensive processing.This observation ap- plies not only to well-established internet companies,but also countless startups and niche players as well.Just think,how many companies do you know that start their pitch with "we're going to harvest information on the web and..."? Another strong area of growth is the analysis of user behavior data.Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner)be drowning in a torrent of log data.In fact,logging user behavior generates so much data that many organizations simply can't cope with the volume,and either turn the functionality off or throw away data after some time.This represents lost opportunities,as there is a broadly-held belief that great value lies in insights derived from mining such data.Knowing what users look at,what they click on,how much time they spend,etc.leads to better business decisions and competitive advantages.Broadly, this is known as business intelligence,which encompasses a wide range of technologies including data warehousing,data mining,and analytics

1 C H A P T E R 1 Introduction MapReduce [31] is a programming model for expressing distributed computations on massive amounts of data and an execution framework for large-scale data processing on clusters of commodity servers. It was originally developed by Google and built on well-known principles in parallel and distributed processing dating back several decades. MapReduce has since enjoyed widespread adoption via an open-source implementation called Hadoop, whose development was led by Yahoo (now an Apache project). Today, a vibrant software ecosystem has sprung up around Hadoop, with significant activity in both industry and academia. This book is about scalable approaches to processing large amounts of text with MapReduce. Given this focus, it makes sense to start with the most basic question: Why? There are many answers to this question, but we focus on two. First, “big data” is a fact of the world, and therefore an issue that real-world systems must grapple with. Second, across a wide range of text processing applications, more data translates into more effective algorithms, and thus it makes sense to take advantage of the plentiful amounts of data that surround us. Modern information societies are defined by vast repositories of data, both public and private. Therefore, any practical application must be able to scale up to datasets of interest. For many, this means scaling up to the web, or at least a non-trivial fraction thereof. Any organization built around gathering, analyzing, monitoring, filtering, searching, or organizing web content must tackle large-data problems: “web-scale” processing is practically synonymous with data-intensive processing. This observation applies not only to well-established internet companies, but also countless startups and niche players as well. Just think, how many companies do you know that start their pitch with “we’re going to harvest information on the web and. . . ”? Another strong area of growth is the analysis of user behavior data. Any operator of a moderately successful website can record user activity and in a matter of weeks (or sooner) be drowning in a torrent of log data. In fact, logging user behavior generates so much data that many organizations simply can’t cope with the volume, and either turn the functionality off or throw away data after some time. This represents lost opportunities, as there is a broadly-held belief that great value lies in insights derived from mining such data. Knowing what users look at, what they click on, how much time they spend, etc. leads to better business decisions and competitive advantages. Broadly, this is known as business intelligence, which encompasses a wide range of technologies including data warehousing, data mining, and analytics

点击进入文档下载页（PDF格式）

共152页，可试读30页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录