Distributed File system Chunk servers o file is split into contiguous chunks Typically each chunk is 16-64MB ◆ Each chunk replicated( usually2Xor3× o Try to keep replicas in different racks ■ Master node a.k.a. name node in hadoop hdfs e Stores metadata about where files are stored ◆ Might be replicated a Client library for file access e talks to master to find chunk servers o Connects directly to chunk servers to access data 2021/1/30 同济大学软件学院
2021/1/30 31 Distributed File System ◼ Chunk servers ◆ File is split into contiguous chunks ◆ Typically each chunk is 16-64MB ◆ Each chunk replicated (usually 2x or 3x) ◆ Try to keep replicas in different racks ◼ Master node ◆ a.k.a. Name Node in Hadoop’s HDFS ◆ Stores metadata about where files are stored ◆ Might be replicated ◼ Client library for file access ◆ Talks to master to find chunk servers ◆ Connects directly to chunk servers to access data
Distributed File system a Reliable distributed file system ◆ Data kept in“ chunks” spread across machines Each chunk replicated on different machines a Seamless recovery from disk or machine failure Co lC 0 0 hunk server 1 Chunk server 2 Chunk server 3 Chunk server n Bring computation directly to the data Chunk servers also serve as compute servers 2021/1/30 同济大学软件学院
2021/1/30 32 Distributed File System ◼ Reliable distributed file system ◆ Data kept in “chunks” spread across machines ◆ Each chunk replicated on different machines Seamless recovery from disk or machine failure
Basic idea a Issue: Copying data over a network takes time a dea: Bring computation to data Store files multiple times for reliability a MapReduce addresses these problems Storage Infrastructure- File system a Google: GFS 口 Hadoop:HDFS NEXT ◆ Programming model 口 MapReduce 2021/1/30 同济大学软件学院
2021/1/30 33 Basic Idea ◼ Issue: Copying data over a network takes time ◼ Idea: ◆ Bring computation to data ◆ Store files multiple times for reliability ◼ MapReduce addresses these problems ◆ Storage Infrastructure – File system Google: GFS. Hadoop: HDFS ◆ Programming model MapReduce NEXT
What is HDFS (Hadoop Distributed Eile System)? a HdFS is a distributed file system Makes some unique tradeoffs that are good for MapReduce What hdfs does well Very large read-only or append-only files (individual files may contain Gigabytes/Terabytes of data) Sequential access patterns a What hdfs does not do well o Storing lots of small files ◆Low- agency access ◆ Multiple writers o Writing to arbitrary offsets in the file 2021/1/30 同济大学软件学院
2021/1/30 34 What is HDFS (Hadoop Distributed File System)? ◼ HDFS is a distributed file system ◆ Makes some unique tradeoffs that are good for MapReduce ◼ What HDFS does well: ◆ Very large read-only or append-only files (individual files may contain Gigabytes/Terabytes of data) ◆ Sequential access patterns ◼ What HDFS does not do well: ◆ Storing lots of small files ◆ Low-latency access ◆ Multiple writers ◆ Writing to arbitrary offsets in the file 34 University of Pennsylvania
HDFS versus NFS Network File System( NFs Hadoop distributed File system(HDFS) Single machine makes part of its Single virtual file system file system available to other spread over many machines machines Optimized for sequential Sequential or random access read and local accesses PRO: Simplicity, generality, PRO: High throughput, high transparency capacity CON: Storage capacity and CON: Specialized for throughput limited by single particular types of server applications 2021/1/30 同济大学软件学院
2021/1/30 35 HDFS versus NFS ◼ Single machine makes part of its file system available to other machines ◼ Sequential or random access ◼ PRO: Simplicity, generality, transparency ◼ CON: Storage capacity and throughput limited by single server ◼ Single virtual file system spread over many machines ◼ Optimized for sequential read and local accesses ◼ PRO: High throughput, high capacity ◼ CON: Specialized for particular types of applications Network File System (NFS) Hadoop Distributed File System (HDFS)