Motivation Very large volumes of data being collected Driven by growth of web,social media,and more recently internet-of- things Web logs were an early source of data Analytics on web logs has great value for advertisements,web site structuring,what posts to show to a user,etc ■ Big Data:differentiated from data handled by earlier generation databases Volume:much larger amounts of data stored Velocity:much higher rates of insertions Variety:many types of data,beyond relational data Database System Concepts-7th Edition 10.2 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.2 ©Silberschatz, Korth and Sudarshan th Edition Motivation ▪ Very large volumes of data being collected • Driven by growth of web, social media, and more recently internet-ofthings • Web logs were an early source of data ▪ Analytics on web logs has great value for advertisements, web site structuring, what posts to show to a user, etc ▪ Big Data: differentiated from data handled by earlier generation databases • Volume: much larger amounts of data stored • Velocity: much higher rates of insertions • Variety: many types of data, beyond relational data
Querying Big Data Transaction processing systems that need very high scalability Many applications willing to sacrifice ACID properties and other database features,if they can get very high scalability Query processing systems that Need very high scalability,and Need to support non-relation data Database System Concepts-7th Edition 10.3 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.3 ©Silberschatz, Korth and Sudarshan th Edition Querying Big Data ▪ Transaction processing systems that need very high scalability • Many applications willing to sacrifice ACID properties and other database features, if they can get very high scalability ▪ Query processing systems that • Need very high scalability, and • Need to support non-relation data
Big Data Storage Systems Distributed file systems Shardring across multiple databases Key-value storage systems Parallel and distributed databases Database System Concepts-7th Edition 10.4 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.4 ©Silberschatz, Korth and Sudarshan th Edition Big Data Storage Systems ▪ Distributed file systems ▪ Shardring across multiple databases ▪ Key-value storage systems ▪ Parallel and distributed databases
Distributed File Systems A distributed file system stores data across a large collection of machines, but provides single file-system view Highly scalable distributed file system for large data-intensive applications. E.g.,10K nodes,100 million files,10 PB Provides redundant storage of massive amounts of data on cheap and unreliable computers Files are replicated to handle hardware failure Detect failures and recovers from them Examples: Google File System(GFS) Hadoop File System (HDFS) Database System Concepts-7th Edition 10.5 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.5 ©Silberschatz, Korth and Sudarshan th Edition Distributed File Systems ▪ A distributed file system stores data across a large collection of machines, but provides single file-system view ▪ Highly scalable distributed file system for large data-intensive applications. • E.g., 10K nodes, 100 million files, 10 PB ▪ Provides redundant storage of massive amounts of data on cheap and unreliable computers • Files are replicated to handle hardware failure • Detect failures and recovers from them ▪ Examples: • Google File System (GFS) • Hadoop File System (HDFS)
Hadoop File System Architecture Single Namespace for entire NameNode Metadata(name,replicas,..) cluster Metadata Ops Files are broken up into BackupNode blocks Metadata(name,replicas..) Client Typically 64 MB block size Block Read DataNodes Each block replicated on multiple DataNodes L Blocks Client 。 Finds location of blocks Client from NameNode Block Write Accesses data directly Replication from DataNode Rack 1 Rack 2 Database System Concepts-7th Edition 10.6 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 10.6 ©Silberschatz, Korth and Sudarshan th Edition Hadoop File System Architecture ▪ Single Namespace for entire cluster ▪ Files are broken up into blocks • Typically 64 MB block size • Each block replicated on multiple DataNodes ▪ Client • Finds location of blocks from NameNode • Accesses data directly from DataNode