▣ 7 Source:Ricardo Guimaraes Herrmann
Source: Ricardo Guimarães Herrmann
Design Principles of Hadoop Automatic parallelization distribution Hidden from the end-user Fault tolerance and automatic recovery Nodes/tasks will fail and will recover automatically Clean and simple programming abstraction ·Users only provide two functions“map'and“reduce” 8
Design Principles of Hadoop • Automatic parallelization & distribution • Hidden from the end-user • Fault tolerance and automatic recovery • Nodes/tasks will fail and will recover automatically • Clean and simple programming abstraction • Users only provide two functions “map” and “reduce” 8
Distributed File System Don't move data to workers...move workers to the data! Store data on the local disks of nodes in the cluster Start up the workers on the node that has the data local ·Why? Not enough RAM to hold all the data in memory Disk access is slow,but disk throughput is reasonable A distributed file system is the answer 。 GFS(Google File System) HDFS for Hadoop(=GFS clone)
Distributed File System • Don’t move data to workers… move workers to the data! • Store data on the local disks of nodes in the cluster • Start up the workers on the node that has the data local • Why? • Not enough RAM to hold all the data in memory • Disk access is slow, but disk throughput is reasonable • A distributed file system is the answer • GFS (Google File System) • HDFS for Hadoop (= GFS clone)
Hadoop:How it Works 10
Hadoop: How it Works 10
Hadoop HBase Pig Hive Chukwa MapReduce HDFS Z00 Keeper Hadoop's stated mission(Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications
Hadoop HBase MapReduce Core Avro HDFS Zoo Keeper Pig Hive Chukwa Hadoop’s stated mission (Doug Cutting interview): Commoditize infrastructure for web-scale, data-intensive applications