Google server room in Council Bluffs, Iowa Data centers consume up to 1.5 percent of all the world's electricity The huge fans sound like jet engines jacked through Marshall amps 2021/1/30 同济大学软件学院
2021/1/30 26 Google server room in Council Bluffs, Iowa Data centers consume up to 1.5 percent of all the world’s electricity The huge fans sound like jet engines jacked through Marshall amps
A central cooling plant in Google's Douglas County, Georgia, data center http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/ally 2021/1/30 同济大学软件学院
2021/1/30 27 A central cooling plant in Google’s Douglas County, Georgia, data center http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/
Large-scale Computing a Large-scale computing for data mining problems on commodity hardware a Challenges o How do you distribute computation? How can we make it easy to write distributed programs Machines fail (fault tolerance) a One server may stay up 3 years(1,000 days) a If you have 1,000 servers, expect to loose 1/day a With 1M machines 1,000 machines fail every day! 2021/1/30 同济大学软件学院
2021/1/30 28 Large-scale Computing ◼ Large-scale computing for data mining problems on commodity hardware ◼ Challenges: ◆ How do you distribute computation? ◆ How can we make it easy to write distributed programs? ◆ Machines fail (fault tolerance): One server may stay up 3 years (1,000 days) If you have 1,000 servers, expect to loose 1/day With 1M machines 1,000 machines fail every day!
Basic idea a Issue: Copying data over a network takes time a dea: Bring computation to data Store files multiple times for reliability a MapReduce addresses these problems Storage Infrastructure-File system 口 Google:GFS.<Nx 口 Hadoop:HDFS ◆ Programming model 口 MapReduce 2021/1/30 同济大学软件学院
2021/1/30 29 Basic Idea ◼ Issue: Copying data over a network takes time ◼ Idea: ◆ Bring computation to data ◆ Store files multiple times for reliability ◼ MapReduce addresses these problems ◆ Storage Infrastructure – File system Google: GFS. Hadoop: HDFS ◆ Programming model MapReduce NEXT
Storage Infrastructure ■ Proble: If nodes fail, how to store data persistently? ◆ Answer a Distributed File System Provides global file namespace a Typical usage pattern Huge files(100s of GB to TB o Data is rarely updated in place Key assumption o Reads and appends are common 2021/1/30 同济大学软件学院
2021/1/30 30 Storage Infrastructure ◼ Problem: ◆ If nodes fail, how to store data persistently? ◆ Answer: Distributed File System: ➢ Provides global file namespace ◼ Typical usage pattern ◆ Huge files (100s of GB to TB) ◆ Data is rarely updated in place ◆ Reads and appends are common Key assumption