专题l:MapReduce的概念、原理 与应用 谢磊博士 南京大学计算机科学与技术系
专题1: MapReduce的概念、原理 与应用 谢磊 博士 南京大学计算机科学与技术系
主要内容: 一、MapReduce的应用背景 二 MapReduce的概念 三、MapReduce的原理 四、MapReduce的实现 五、MapReduce的性能 六、参考文献
一、MapReduce的应用背景 三、MapReduce的原理 主要内容: 二、MapReduce的概念 五、MapReduce的性能 四、MapReduce的实现 六、参考文献
MapReduce的应用背景-l Google have implemented hundreds of special- purpose computations that process large amounts of raw data, such as crawled documents,web request logs,etc
MapReduce的应用背景-1 • Google have implemented hundreds of specialpurpose computa6ons that process large amounts of raw data, – such as crawled documents, web request logs, etc
MapReduce的应用背景-1 Google's data center compute various kinds of derived data. Various representations Inverted indices of the graph structure of web documents Summaries of the number of pages The set of most crawled per host frequent queries in a given
MapReduce的应用背景-1 • Google’s data center compute various kinds of derived data. Inverted indices Various representa6ons of the graph structure of web documents Summaries of the number of pages crawled per host The set of most frequent queries in a given
MapReduce的应用背景-2 Most such computations are conceptually straightforward. However, the input data is usually large and the computations have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of time. 数据总量 ·100~1000PB 数据处理量 ·10~100PB/天 oml 网页 ·千亿万亿 索引 ·百亿千亿 更新量 ·十亿~百亿天 orig 请求 ·十亿~百亿/天 X C 日志 100TB~1PB/天
MapReduce的应用背景-2 • Most such computa6ons are conceptually straighDorward. However, – the input data is usually large – and the computa6ons have to be distributed across hundreds or thousands of machines in order to finish in a reasonable amount of 6me. • The issues of – how to parallelize the computa6on, – distribute the data, – and handle failures • conspire to obscure the original simple computa6on with large amounts of complex code to deal with these issues