Hadoop Distributed File System(HDFS) NameNode Maps a filename to list of Block IDs Maps each Block ID to DataNodes containing a replica of the block DataNode:Maps a Block ID to a physical location on disk Data Coherency Write-once-read-many access model Client can only append to existing files Distributed file systems good for millions of large files But have very high overheads and poor performance with billions of smaller tuples Database System Concepts-7th Edition 10.7 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.7 ©Silberschatz, Korth and Sudarshan th Edition Hadoop Distributed File System (HDFS) ▪ NameNode • Maps a filename to list of Block IDs • Maps each Block ID to DataNodes containing a replica of the block ▪ DataNode: Maps a Block ID to a physical location on disk ▪ Data Coherency • Write-once-read-many access model • Client can only append to existing files ▪ Distributed file systems good for millions of large files • But have very high overheads and poor performance with billions of smaller tuples
Sharding Sharding:partition data across multiple databases Partitioning usually done on some partitioning attributes(also known as partitioning keys or shard keys e.g.user ID E.g.,records with key values from 1 to 100,000 on database 1, records with key values from 100,001 to 200,000 on database 2,etc. Application must track which records are on which database and send queries/updates to that database Positives:scales well,easy to implement Drawbacks: Not transparent:application has to deal with routing of queries, queries that span multiple databases When a database is overloaded,moving part of its load out is not easy Chance of failure more with more databases need to keep replicas to ensure availability,which is more work for application Database System Concepts-7th Edition 10.8 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 10.8 ©Silberschatz, Korth and Sudarshan th Edition Sharding ▪ Sharding: partition data across multiple databases ▪ Partitioning usually done on some partitioning attributes (also known as partitioning keys or shard keys e.g. user ID • E.g., records with key values from 1 to 100,000 on database 1, records with key values from 100,001 to 200,000 on database 2, etc. ▪ Application must track which records are on which database and send queries/updates to that database ▪ Positives: scales well, easy to implement ▪ Drawbacks: • Not transparent: application has to deal with routing of queries, queries that span multiple databases • When a database is overloaded, moving part of its load out is not easy • Chance of failure more with more databases ▪ need to keep replicas to ensure availability, which is more work for application
Key Value Storage Systems Key-value storage systems store large numbers(billions or even more)of small (KB-MB)sized records Records are partitioned across multiple machines and Queries are routed by the system to appropriate machine Records are also replicated across multiple machines,to ensure availability even if a machine fails Key-value stores ensure that updates are applied to all replicas,to ensure that their values are consistent Database System Concepts-7th Edition 10.10 @Silberschatz,Korth and Sudarshan
Database System Concepts - 7 10.10 ©Silberschatz, Korth and Sudarshan th Edition Key Value Storage Systems ▪ Key-value storage systems store large numbers (billions or even more) of small (KB-MB) sized records ▪ Records are partitioned across multiple machines and ▪ Queries are routed by the system to appropriate machine ▪ Records are also replicated across multiple machines, to ensure availability even if a machine fails • Key-value stores ensure that updates are applied to all replicas, to ensure that their values are consistent
Key Value Storage Systems Key-value stores may store uninterpreted bytes,with an associated key E.g.,Amazon S3,Amazon Dynamo Wide-table(can have arbitrarily many attribute names)with associated key Google BigTable,Apache Cassandra,Apache Hbase, Amazon DynamoDB Allows some operations(e.g.,filtering)to execute on storage node ·JSON MongoDB,CouchDB(document model) Document stores store semi-structured data,typically JSON Some key-value stores support multiple versions of data,with timestamps/version numbers Database System Concepts-7th Edition 10.11 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.11 ©Silberschatz, Korth and Sudarshan th Edition Key Value Storage Systems ▪ Key-value stores may store • uninterpreted bytes, with an associated key ▪ E.g., Amazon S3, Amazon Dynamo • Wide-table (can have arbitrarily many attribute names) with associated key • Google BigTable, Apache Cassandra, Apache Hbase, Amazon DynamoDB • Allows some operations (e.g., filtering) to execute on storage node • JSON ▪ MongoDB, CouchDB (document model) ▪ Document stores store semi-structured data, typically JSON ▪ Some key-value stores support multiple versions of data, with timestamps/version numbers
Data Representation An example of a JSON object is: { "ID":"22222", "name": "firstname:"Albert", "lastname:"Einstein" }, "deptname":"Physics", "children": {"firstname":"Hans","lastname":"Einstein"}, "firstname":"Eduard","lastname":"Einstein"} } Database System Concepts-7th Edition 10.12 ©Silberscha乜,Korth and Sudarshan
Database System Concepts - 7 10.12 ©Silberschatz, Korth and Sudarshan th Edition Data Representation ▪ An example of a JSON object is: { "ID": "22222", "name": { "firstname: "Albert", "lastname: "Einstein" }, "deptname": "Physics", "children": [ { "firstname": "Hans", "lastname": "Einstein" }, { "firstname": "Eduard", "lastname": "Einstein" } ] }