Open Discussion btw PDBMS Vs MR a PDBMS community 1. MapReduce: A major step backwards 2. A Comparison of Approaches to Large-Scale Data Analysis 3. Map Reduce and Parallel DBMSs: Friends or Foes? a MR community 1. MapReduce: A Flexible Data Processing Tool 2021/1/30 同济大学软件学院
2021/1/30 21 Open Discussion btw PDBMS Vs MR ◼ PDBMS community: ◼ MR community: 1. MapReduce: A major step backwards 2. A Comparison of Approaches to Large-Scale Data Analysis 3. MapReduce and Parallel DBMSs: Friends or Foes? 1. MapReduce: A Flexible Data Processing Tool
PDBMS VS MR PDBMS MR Schema Support Not out of the box Indexing Programming model Declarative Imperative(C/C++ (SQL) Java,.)EXtensions through Pig and Hive Query Optimization 区 Flexibility 区 Fault Tolerance Coarse grained techniques 2021/1/30 同济大学软件学院
2021/1/30 22 PDBMS Vs MR PDBMS MR Schema Support Not out of the box Indexing Programming Model Declarative (SQL) Imperative (C/C++, Java, …) Extensions through Pig and Hive Query Optimization Flexibility Fault Tolerance Coarse grained techniques
Single Node Architecture CPU Machine Learning, Statistics Memory Classical "Data Mining Disk 2021/1/30 同济大学软件学院
2021/1/30 23 Single Node Architecture
Motivation: Google Example 20+ billion web pages x 20KB 400+ TB a 1 computer reads 30-35 MB/sec from disk e M4 months to read the web a Takes even more to do something useful with the datal a Recently standard architecture for such problems emerged: Cluster of commodity Linux nodes o Commodity network(ethernet) to connect them 2021/1/30 同济大学软件学院
2021/1/30 24 Motivation: Google Example ◼ 20+ billion web pages x 20KB = 400+ TB ◼ 1 computer reads 30-35 MB/sec from disk ◆ ~4 months to read the web ◼ Takes even more to do something useful with the data! ◼ Recently standard architecture for such problems emerged: ◆ Cluster of commodity Linux nodes ◆ Commodity network (ethernet) to connect them
Cluster Architecture 2-10 Gbps backbone between racks 1Gbps between Switch any pair of nodes in a rack Switch Switch CPU CPU CPU CPU Mem Mem Mem Mem Disk Disk Disk Disk Each rack contains 16-64 nodes In2011itwasguestimatedthatGooglehad1mmachineshttp:/bit.l/shHoro 2021/1/30 同济大学软件学院
2021/1/30 25 Cluster Architecture