The Google File System Sanjay Ghemawat,Howard Gobioff,and Shun-Tak Leung Google ABSTRACT 1.INTRODUCTION We have designed and implemented the Google File Sys- We have designed and implemented the Google File Sys- tem,a scalable distributed file system for large distributed tem(GFS)to meet the rapidly growing demands of Google's data-intensive applications.It provides fault tolerance while data processing needs.GFS shares many of the same goals running on inexpensive commodity hardware,and it delivers as previous distributed file systems such as performance, high aggregate performance to a large number of clients. scalability,reliability,and availability.However,its design While sharing many of the same goals as previous dis- has been driven by key observations of our application work- tributed file systems,our design has been driven by obser- loads and technological environment,both current and an- vations of our application workloads and technological envi- ticipated,that reflect a marked departure from some earlier ronment,both current and anticipated,that refect a marked file system design assumptions.We have reexamined tradi- departure from some earlier file system assumptions.This tional choices and explored radically different points in the has led us to reexamine traditional choices and explore rad- design space. ically different design points. First,component failures are the norm rather than the The file system has successfully met our storage needs. exception.The file system consists of hundreds or even It is widely deployed within Google as the storage platform thousands of storage machines built from inexpensive com- for the generation and processing of data used by our ser- modity parts and is accessed by a comparable number of vice as well as research and development efforts that require client machines.The quantity and quality of the compo- large data sets.The largest cluster to date provides hun- nents virtually guarantee that some are not functional at dreds of terabytes of storage across thousands of disks on any given time and some will not recover from their cur- over a thousand machines,and it is concurrently accessed rent failures.We have seen problems caused by application by hundreds of clients. bugs,operating system bugs,human errors,and the failures In this paper,we present file system interface extensions of disks,memory,connectors,networking,and power sup- designed to support distributed applications,discuss many plies.Therefore,constant monitoring,error detection,fault aspects of our design,and report measurements from both tolerance,and automatic recovery must be integral to the micro-benchmarks and real world use. system. Second,files are huge by traditional standards.Multi-GB Categories and Subject Descriptors files are common.Each file typically contains many applica- tion objects such as web documents.When we are regularly D [4:3-Distributed file systems working with fast growing data sets of many TBs comprising billions of objects,it is unwieldy to manage billions of ap General Terms proximately KB-sized files even when the file system could support it.As a result,design assumptions and parameters Design,reliability,performance,measurement such as I/O operation and block sizes have to be revisited. Third,most files are mutated by appending new data Keywords rather than overwriting existing data.Random writes within Fault tolerance,scalability,data storage,clustered storage a file are practically non-existent.Once written,the files are only read,and often only sequentially.A variety of *The authors can be reached at the following addresses: data share these characteristics.Some may constitute large sanjay,hgobioff.shuntak.@google.com. repositories that data analysis programs scan through.Some may be data streams continuously generated by running ap- plications.Some may be archival data.Some may be in- termediate results produced on one machine and processed Permission to make digital or hard copies of all or part of this work for on another,whether simultaneously or later in time.Given personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies this access pattern on huge files,appending becomes the fo- cus of performance optimization and atomicity guarantees. bear this notice and the full citation on the first page.To copy otherwise,to republish,to post on servers or to redistribute to lists,requires prior specific while caching data blocks in the client loses its appeal. permission and/or a fee. Fourth,co-designing the applications and the file system SOSP'03.October 19-22,2003,Bolton Landing.New York,USA. API benefits the overall system by increasing our flexibility Copyright2003ACM1-58113-757-5/03/0010.$5.00
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗ ABSTRACT We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use. Categories and Subject Descriptors D [4]: 3—Distributed file systems General Terms Design, reliability, performance, measurement Keywords Fault tolerance, scalability, data storage, clustered storage ∗The authors can be reached at the following addresses: {sanjay,hgobioff,shuntak}@google.com. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SOSP’03, October 19–22, 2003, Bolton Landing, New York, USA. Copyright 2003 ACM 1-58113-757-5/03/0010 ...$5.00. 1. INTRODUCTION We have designed and implemented the Google File System (GFS) to meet the rapidly growing demands of Google’s data processing needs. GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. However, its design has been driven by key observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system design assumptions. We have reexamined traditional choices and explored radically different points in the design space. First, component failures are the norm rather than the exception. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and is accessed by a comparable number of client machines. The quantity and quality of the components virtually guarantee that some are not functional at any given time and some will not recover from their current failures. We have seen problems caused by application bugs, operating system bugs, human errors, and the failures of disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the system. Second, files are huge by traditional standards. Multi-GB files are common. Each file typically contains many application objects such as web documents. When we are regularly working with fast growing data sets of many TBs comprising billions of objects, it is unwieldy to manage billions of approximately KB-sized files even when the file system could support it. As a result, design assumptions and parameters such as I/O operation and blocksizes have to be revisited. Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal. Fourth, co-designing the applications and the file system API benefits the overall system by increasing our flexibility
For example,we have relaxed GFS's consistency model to 2.2 Interface vastly simplify the file system without imposing an onerous GFS provides a familiar file system interface,though it burden on the applications.We have also introduced an does not implement a standard API such as POSIX.Files are atomic append operation so that multiple clients can append organized hierarchically in directories and identified by path- concurrently to a file without extra synchronization between names.We support the usual operations to create,delete, them.These will be discussed in more details later in the open,close,read,and write files. paper Moreover,GFS has snapshot and record append opera- Multiple GFS clusters are currently deployed for different tions.Snapshot creates a copy of a file or a directory tree purposes.The largest ones have over 1000 storage nodes, at low cost.Record append allows multiple clients to ap- over 300 TB of disk storage,and are heavily accessed by pend data to the same file concurrently while guaranteeing hundreds of clients on distinct machines on a continuous the atomicity of each individual client's append.It is use- basis ful for implementing multi-way merge results and producer- consumer queues that many clients can simultaneously ap- 2. DESIGN OVERVIEW pend to without additional locking.We have found these types of files to be invaluable in building large distributed 2.1 Assumptions applications.Snapshot and record append are discussed fur- In designing a file system for our needs,we have been ther in Sections 3.4 and 3.3 respectively. guided by assumptions that offer both challenges and op- portunities.We alluded to some key observations earlier 2.3 Architecture and now lay out our assumptions in more details. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,as shown The system is built from many inexpensive commodity in Figure 1.Each of these is typically a commodity Linux components that often fail.It must constantly monitor machine running a user-level server process.It is easy to run itself and detect,tolerate,and recover promptly from both a chunkserver and a client on the same machine,as long component failures on a routine basis. as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable. The system stores a modest number of large files.We Files are divided into fixed-size chunks.Each chunk is expect a few million files,each typically 100 MB or identified by an immutable and globally unique 64 bit chunk larger in size.Multi-GB files are the common case and should be managed efficiently.Small files must be handle assigned by the master at the time of chunk creation supported,but we need not optimize for them. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and The workloads primarily consist of two kinds of reads: byte range.For reliability,each chunk is replicated on multi- large streaming reads and small random reads.In ple chunkservers.By default,we store three replicas,though large streaming reads,individual operations typically users can designate different replication levels for different read hundreds of KBs.more commonly 1 MB or more. regions of the file namespace. Successive operations from the same client often read The master maintains all file system metadata.This in- through a contiguous region of a file.A small ran- cludes the namespace,access control information,the map- dom read typically reads a few KBs at some arbitrary ping from files to chunks,and the current locations of chunks. offset.Performance-conscious applications often batch It also controls system-wide activities such as chunk lease and sort their small reads to advance steadily through management,garbage collection of orphaned chunks,and the file rather than go back and forth. chunk migration between chunkservers.The master peri- odically communicates with each chunkserver in HeartBeat The workloads also have many large,sequential writes messages to give it instructions and collect its state. that append data to files.Typical operation sizes are GFS client code linked into each application implements similar to those for reads.Once written,files are sel- the file system API and communicates with the master and dom modified again.Small writes at arbitrary posi- chunkservers to read or write data on behalf of the applica- tions in a file are supported but do not have to be tion.Clients interact with the master for metadata opera- efficient. tions,but all data-bearing communication goes directly to The system must efficiently implement well-defined se- the chunkservers.We do not provide the POSIX API and therefore need not hook into the Linux vnode layer. mantics for multiple clients that concurrently append Neither the client nor the chunkserver caches file data. to the same file.Our files are often used as producer- consumer queues or for many-way merging.Hundreds Client caches offer little benefit because most applications of producers,running one per machine,will concur- stream through huge files or have working sets too large to be cached.Not having them simplifies the client and rently append to a file.Atomicity with minimal syn- chronization overhead is essential.The file may be the overall system by eliminating cache coherence issues. (Clients do cache metadata,however.)Chunkservers need read later,or a consumer may be reading through the file simultaneously. not cache file data because chunks are stored as local files and so Linux's buffer cache already keeps frequently accessed High sustained bandwidth is more important than low data in memory. latency.Most of our target applications place a pre- mium on processing data in bulk at a high rate,while 2.4 Single Master few have stringent response time requirements for an Having a single master vastly simplifies our design and individual read or write. enables the master to make sophisticated chunk placement
For example, we have relaxed GFS’s consistency model to vastly simplify the file system without imposing an onerous burden on the applications. We have also introduced an atomic append operation so that multiple clients can append concurrently to a file without extra synchronization between them. These will be discussed in more details later in the paper. Multiple GFS clusters are currently deployed for different purposes. The largest ones have over 1000 storage nodes, over 300 TB of diskstorage, and are heavily accessed by hundreds of clients on distinct machines on a continuous basis. 2. DESIGN OVERVIEW 2.1 Assumptions In designing a file system for our needs, we have been guided by assumptions that offer both challenges and opportunities. We alluded to some key observations earlier and now lay out our assumptions in more details. • The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis. • The system stores a modest number of large files. We expect a few million files, each typically 100 MB or larger in size. Multi-GB files are the common case and should be managed efficiently. Small files must be supported, but we need not optimize for them. • The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. In large streaming reads, individual operations typically read hundreds of KBs, more commonly 1 MB or more. Successive operations from the same client often read through a contiguous region of a file. A small random read typically reads a few KBs at some arbitrary offset. Performance-conscious applications often batch and sort their small reads to advance steadily through the file rather than go backand forth. • The workloads also have many large, sequential writes that append data to files. Typical operation sizes are similar to those for reads. Once written, files are seldom modified again. Small writes at arbitrary positions in a file are supported but do not have to be efficient. • The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producerconsumer queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through the file simultaneously. • High sustained bandwidth is more important than low latency. Most of our target applications place a premium on processing data in bulkat a high rate, while few have stringent response time requirements for an individual read or write. 2.2 Interface GFS provides a familiar file system interface, though it does not implement a standard API such as POSIX. Files are organized hierarchically in directories and identified by pathnames. We support the usual operations to create, delete, open, close, read, and write files. Moreover, GFS has snapshot and record append operations. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append. It is useful for implementing multi-way merge results and producerconsumer queues that many clients can simultaneously append to without additional locking. We have found these types of files to be invaluable in building large distributed applications. Snapshot and record append are discussed further in Sections 3.4 and 3.3 respectively. 2.3 Architecture A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 1. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable. Files are divided into fixed-size chunks. Each chunkis identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunkcreation. Chunkservers store chunks on local disks as Linux files and read or write chunkdata specified by a chunkhandle and byte range. For reliability, each chunkis replicated on multiple chunkservers. By default, we store three replicas, though users can designate different replication levels for different regions of the file namespace. The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunklease management, garbage collection of orphaned chunks, and chunkmigration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state. GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. We do not provide the POSIX API and therefore need not hookinto the Linux vnode layer. Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory. 2.4 Single Master Having a single master vastly simplifies our design and enables the master to make sophisticated chunk placement
Application (file name,chunk index) GFS master /foo/bar GFS client File namespace chunk 2efo (chunk handle, chunk locations) Legend: → Data messages Instructions to chunkserver Control messages Chunkserver state (chunk handle,byte range) GFS chunkserver GFS chunkserver chunk data 0044 Linux file system Linux file system Figure 1:GFS Architecture and replication decisions using global knowledge.However, tent TCP connection to the chunkserver over an extended we must minimize its involvement in reads and writes so period of time.Third,it reduces the size of the metadata that it does not become a bottleneck.Clients never read stored on the master.This allows us to keep the metadata and write file data through the master.Instead,a client asks in memory,which in turn brings other advantages that we the master which chunkservers it should contact.It caches will discuss in Section 2.6.1. this information for a limited time and interacts with the On the other hand,a large chunk size,even with lazy space chunkservers directly for many subsequent operations. allocation,has its disadvantages.A small file consists of a Let us explain the interactions for a simple read with refer- small number of chunks,perhaps just one.The chunkservers ence to Figure 1.First,using the fixed chunk size,the client storing those chunks may become hot spots if many clients translates the file name and byte offset specified by the ap- are accessing the same file.In practice,hot spots have not plication into a chunk index within the file.Then,it sends been a major issue because our applications mostly read the master a request containing the file name and chunk large multi-chunk files sequentially. index.The master replies with the corresponding chunk However,hot spots did develop when GFS was first used handle and locations of the replicas.The client caches this by a batch-queue system:an executable was written to GFS information using the file name and chunk index as the key as a single-chunk file and then started on hundreds of ma- The client then sends a request to one of the replicas, chines at the same time.The few chunkservers storing this most likely the closest one.The request specifies the chunk executable were overloaded by hundreds of simultaneous re- handle and a byte range within that chunk.Further reads quests.We fixed this problem by storing such executables of the same chunk require no more client-master interaction with a higher replication factor and by making the batch- until the cached information expires or the file is reopened. queue system stagger application start times.A potential In fact,the client typically asks for multiple chunks in the long-term solution is to allow clients to read data from other same request and the master can also include the informa- clients in such situations. tion for chunks immediately following those requested.This extra information sidesteps several future client-master in- 2.6 Metadata teractions at practically no extra cost. The master stores three major types of metadata:the file 2.5 Chunk Size and chunk namespaces,the mapping from files to chunks and the locations of each chunk's replicas.All metadata is Chunk size is one of the key design parameters.We have kept in the master's memory.The first two types (names- chosen 64 MB,which is much larger than typical file sys- paces and file-to-chunk mapping)are also kept persistent by tem block sizes.Each chunk replica is stored as a plain logging mutations to an operation log stored on the mas- Linux file on a chunkserver and is extended only as needed. ter's local disk and replicated on remote machines.Using Lazy space allocation avoids wasting space due to internal a log allows us to update the master state simply,reliably, fragmentation,perhaps the greatest objection against such and without risking inconsistencies in the event of a master a large chunk size. crash.The master does not store chunk location informa- A large chunk size offers several important advantages. tion persistently.Instead.it asks each chunkserver about its First,it reduces clients'need to interact with the master chunks at master startup and whenever a chunkserver joins because reads and writes on the same chunk require only the cluster. one initial request to the master for chunk location informa- tion.The reduction is especially significant for our work- 2.6.1 In-Memory Data Structures loads because applications mostly read and write large files Since metadata is stored in memory,master operations are sequentially.Even for small random reads,the client can fast.Furthermore.it is easy and efficient for the master to comfortably cache all the chunk location information for a periodically scan through its entire state in the background. multi-TB working set.Second,since on a large chunk,a This periodic scanning is used to implement chunk garbage client is more likely to perform many operations on a given collection,re-replication in the presence of chunkserver fail- chunk,it can reduce network overhead by keeping a persis- ures,and chunk migration to balance load and disk space
Legend: Data messages Control messages Application (file name, chunk index) (chunk handle, chunk locations) GFS master File namespace /foo/bar Instructions to chunkserver Chunkserver state GFS chunkserver GFS chunkserver (chunk handle, byte range) chunk data chunk 2ef0 Linux file system Linux file system GFS client Figure 1: GFS Architecture and replication decisions using global knowledge. However, we must minimize its involvement in reads and writes so that it does not become a bottleneck. Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations. Let us explain the interactions for a simple read with reference to Figure 1. First, using the fixed chunksize, the client translates the file name and byte offset specified by the application into a chunkindex within the file. Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunkindex as the key. The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunkrequire no more client-master interaction until the cached information expires or the file is reopened. In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master interactions at practically no extra cost. 2.5 Chunk Size Chunksize is one of the key design parameters. We have chosen 64 MB, which is much larger than typical file system blocksizes. Each chunkreplica is stored as a plain Linux file on a chunkserver and is extended only as needed. Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunksize. A large chunksize offers several important advantages. First, it reduces clients’ need to interact with the master because reads and writes on the same chunkrequire only one initial request to the master for chunklocation information. The reduction is especially significant for our workloads because applications mostly read and write large files sequentially. Even for small random reads, the client can comfortably cache all the chunklocation information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master. This allows us to keep the metadata in memory, which in turn brings other advantages that we will discuss in Section 2.6.1. On the other hand, a large chunksize, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file. In practice, hot spots have not been a major issue because our applications mostly read large multi-chunkfiles sequentially. However, hot spots did develop when GFS was first used by a batch-queue system: an executable was written to GFS as a single-chunkfile and then started on hundreds of machines at the same time. The few chunkservers storing this executable were overloaded by hundreds of simultaneous requests. We fixed this problem by storing such executables with a higher replication factor and by making the batchqueue system stagger application start times. A potential long-term solution is to allow clients to read data from other clients in such situations. 2.6 Metadata The master stores three major types of metadata: the file and chunknamespaces, the mapping from files to chunks, and the locations of each chunk’s replicas. All metadata is kept in the master’s memory. The first two types (namespaces and file-to-chunkmapping) are also kept persistent by logging mutations to an operation log stored on the master’s local diskand replicated on remote machines. Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. The master does not store chunklocation information persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster. 2.6.1 In-Memory Data Structures Since metadata is stored in memory, master operations are fast. Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background. This periodic scanning is used to implement chunkgarbage collection, re-replication in the presence of chunkserver failures, and chunkmigration to balance load and diskspace
usage across chunkservers.Sections 4.3 and 4.4 will discuss Write Record Append these activities further. Serial defined defined One potential concern for this memory-only approach is success interspersed with that the number of chunks and hence the capacity of the Concurrent consistent inconsistent successes but undefined whole system is limited by how much memory the master Failure inconsistent has.This is not a serious limitation in practice.The mas- ter maintains less than 64 bytes of metadata for each 64 MB chunk.Most chunks are full because most files contain many Table 1:File Region State After Mutation chunks,only the last of which may be partially filled.Sim- ilarly,the file namespace data typically requires less then 64 bytes per file because it stores file names compactly us- limited number of log records after that.The checkpoint is ing prefix compression. If necessary to support even larger file systems,the cost in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without ex- of adding extra memory to the master is a small price to pay tra parsing.This further speeds up recovery and improves for the simplicity,reliability,performance,and flexibility we availability. gain by storing the metadata in memory. Because building a checkpoint can take a while,the mas- 2.6.2 Chunk Locations ter's internal state is structured in such a way that a new checkpoint can be created without delaying incoming muta- The master does not keep a persistent record of which tions.The master switches to a new log file and creates the chunkservers have a replica of a given chunk.It simply polls new checkpoint in a separate thread.The new checkpoint chunkservers for that information at startup.The master includes all mutations before the switch.It can be created can keep itself up-to-date thereafter because it controls all in a minute or so for a cluster with a few million files.When chunk placement and monitors chunkserver status with reg- completed,it is written to disk both locally and remotely. ular HeartBeat messages. Recovery needs only the latest complete checkpoint and We initially attempted to keep chunk location information subsequent log files.Older checkpoints and log files can persistently at the master,but we decided that it was much be freely deleted.though we keep a few around to guard simpler to request the data from chunkservers at startup, against catastrophes.A failure during checkpointing does and periodically thereafter.This eliminated the problem of not affect correctness because the recovery code detects and keeping the master and chunkservers in sync as chunkservers skips incomplete checkpoints join and leave the cluster,change names,fail,restart,and so on.In a cluster with hundreds of servers,these events 2.7 Consistency Model happen all too often. Another way to understand this design decision is to real- GFS has a relaxed consistency model that supports our ize that a chunkserver has the final word over what chunks highly distributed applications well but remains relatively it does or does not have on its own disks.There is no point simple and efficient to implement.We now discuss GFS's in trying to maintain a consistent view of this information guarantees and what they mean to applications.We also on the master because errors on a chunkserver may cause highlight how GFS maintains these guarantees but leave the chunks to vanish spontaneously (e.g.,a disk may go bad details to other parts of the paper. and be disabled)or an operator may rename a chunkserver. 2.7.1 Guarantees by GFS 2.6.3 Operation Log File namespace mutations (e.g.,file creation)are atomic. The operation log contains a historical record of critical They are handled exclusively by the master:namespace metadata changes.It is central to GFS.Not only is it the locking guarantees atomicity and correctness (Section 4.1); only persistent record of metadata,but it also serves as a the master's operation log defines a global total order of logical time line that defines the order of concurrent op- these operations (Section 2.6.3). erations.Files and chunks,as well as their versions (see The state of a file region after a data mutation depends Section 4.5),are all uniquely and eternally identified by the on the type of mutation,whether it succeeds or fails,and logical times at which they were created. whether there are concurrent mutations.Table 1 summa- Since the operation log is critical,we must store it reli- rizes the result.A file region is consistent if all clients will ably and not make changes visible to clients until metadata always see the same data,regardless of which replicas they changes are made persistent.Otherwise,we effectively lose read from.A region is defined after a file data mutation if it the whole file system or recent client operations even if the is consistent and clients will see what the mutation writes in chunks themselves survive.Therefore,we replicate it on its entirety.When a mutation succeeds without interference multiple remote machines and respond to a client opera- from concurrent writers,the affected region is defined (and tion only after flushing the corresponding log record to disk by implication consistent):all clients will always see what both locally and remotely.The master batches several log the mutation has written.Concurrent successful mutations records together before flushing thereby reducing the impact leave the region undefined but consistent:all clients see the of flushing and replication on overall system throughput. same data,but it may not reflect what any one mutation The master recovers its file system state by replaying the has written.Typically,it consists of mingled fragments from operation log.To minimize startup time,we must keep the multiple mutations.A failed mutation makes the region in- log small.The master checkpoints its state whenever the log consistent (hence also undefined):different clients may see grows beyond a certain size so that it can recover by loading different data at different times.We describe below how our the latest checkpoint from local disk and replaying only the applications can distinguish defined regions from undefined
usage across chunkservers. Sections 4.3 and 4.4 will discuss these activities further. One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has. This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression. If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility we gain by storing the metadata in memory. 2.6.2 Chunk Locations The master does not keep a persistent record of which chunkservers have a replica of a given chunk. It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunkplacement and monitors chunkserver status with regular HeartBeat messages. We initially attempted to keep chunk location information persistently at the master, but we decided that it was much simpler to request the data from chunkservers at startup, and periodically thereafter. This eliminated the problem of keeping the master and chunkservers in sync as chunkservers join and leave the cluster, change names, fail, restart, and so on. In a cluster with hundreds of servers, these events happen all too often. Another way to understand this design decision is to realize that a chunkserver has the final word over what chunks it does or does not have on its own disks. There is no point in trying to maintain a consistent view of this information on the master because errors on a chunkserver may cause chunks to vanish spontaneously (e.g., a disk may go bad and be disabled) or an operator may rename a chunkserver. 2.6.3 Operation Log The operation log contains a historical record of critical metadata changes. It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions (see Section 4.5), are all uniquely and eternally identified by the logical times at which they were created. Since the operation log is critical, we must store it reliably and not make changes visible to clients until metadata changes are made persistent. Otherwise, we effectively lose the whole file system or recent client operations even if the chunks themselves survive. Therefore, we replicate it on multiple remote machines and respond to a client operation only after flushing the corresponding log record to disk both locally and remotely. The master batches several log records together before flushing thereby reducing the impact of flushing and replication on overall system throughput. The master recovers its file system state by replaying the operation log. To minimize startup time, we must keep the log small. The master checkpoints its state whenever the log grows beyond a certain size so that it can recover by loading the latest checkpoint from local disk and replaying only the Write Record Append Serial defined defined success interspersed with Concurrent consistent inconsistent successes but undefined Failure inconsistent Table 1: File Region State After Mutation limited number of log records after that. The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing. This further speeds up recovery and improves availability. Because building a checkpoint can take a while, the master’s internal state is structured in such a way that a new checkpoint can be created without delaying incoming mutations. The master switches to a new log file and creates the new checkpoint in a separate thread. The new checkpoint includes all mutations before the switch. It can be created in a minute or so for a cluster with a few million files. When completed, it is written to diskboth locally and remotely. Recovery needs only the latest complete checkpoint and subsequent log files. Older checkpoints and log files can be freely deleted, though we keep a few around to guard against catastrophes. A failure during checkpointing does not affect correctness because the recovery code detects and skips incomplete checkpoints. 2.7 Consistency Model GFS has a relaxed consistency model that supports our highly distributed applications well but remains relatively simple and efficient to implement. We now discuss GFS’s guarantees and what they mean to applications. We also highlight how GFS maintains these guarantees but leave the details to other parts of the paper. 2.7.1 Guarantees by GFS File namespace mutations (e.g., file creation) are atomic. They are handled exclusively by the master: namespace locking guarantees atomicity and correctness (Section 4.1); the master’s operation log defines a global total order of these operations (Section 2.6.3). The state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations. Table 1 summarizes the result. A file region is consistent if all clients will always see the same data, regardless of which replicas they read from. A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety. When a mutation succeeds without interference from concurrent writers, the affected region is defined (and by implication consistent): all clients will always see what the mutation has written. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but it may not reflect what any one mutation has written. Typically, it consists of mingled fragments from multiple mutations. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times. We describe below how our applications can distinguish defined regions from undefined
regions.The applications do not need to further distinguish file data that is still incomplete from the application's per- between different kinds of undefined regions. spective. Data mutations may be writes or record appends.A write In the other typical use,many writers concurrently ap- causes data to be written at an application-specified file pend to a file for merged results or as a producer-consumer offset.A record append causes data (the "record")to be queue.Record append's append-at-least-once semantics pre- appended atomically at least once even in the presence of serves each writer's output.Readers deal with the occa- concurrent mutations,but at an offset of GFS's choosing sional padding and duplicates as follows.Each record pre (Section 3.3).(In contrast,a "regular"append is merely a pared by the writer contains extra information like check- write at an offset that the client believes to be the current sums so that its validity can be verified.A reader can end of file.The offset is returned to the client and marks identify and discard extra padding and record fragments the beginning of a defined region that contains the record. using the checksums.If it cannot tolerate the occasional In addition,GFS may insert padding or record duplicates in duplicates (e.g.,if they would trigger non-idempotent op- between.They occupy regions considered to be inconsistent erations),it can filter them out using unique identifiers in and are typically dwarfed by the amount of user data. the records,which are often needed anyway to name corre- After a sequence of successful mutations,the mutated file sponding application entities such as web documents.These region is guaranteed to be defined and contain the data writ- functionalities for record I/O (except duplicate removal)are ten by the last mutation.GFS achieves this by (a)applying in library code shared by our applications and applicable to mutations to a chunk in the same order on all its replicas other file interface implementations at Google.With that, (Section 3.1),and (b)using chunk version numbers to detect the same sequence of records,plus rare duplicates,is always any replica that has become stale because it has missed mu- delivered to the record reader. tations while its chunkserver was down (Section 4.5).Stale replicas will never be involved in a mutation or given to 3.SYSTEM INTERACTIONS clients asking the master for chunk locations.They are garbage collected at the earliest opportunity. We designed the system to minimize the master's involve- Since clients cache chunk locations,they may read from a ment in all operations.With that background,we now de- stale replica before that information is refreshed.This win- scribe how the client,master,and chunkservers interact to dow is limited by the cache entry's timeout and the next implement data mutations,atomic record append,and snap- open of the file,which purges from the cache all chunk in- shot. formation for that file.Moreover,as most of our files are 3.1 Leases and Mutation Order append-only,a stale replica usually returns a premature end of chunk rather than outdated data.When a reader A mutation is an operation that changes the contents or retries and contacts the master,it will immediately get cur- metadata of a chunk such as a write or an append opera- rent chunk locations. tion.Each mutation is performed at all the chunk's replicas. Long after a successful mutation,component failures can We use leases to maintain a consistent mutation order across of course still corrupt or destroy data.GFS identifies failed replicas.The master grants a chunk lease to one of the repli- chunkservers by regular handshakes between master and all cas,which we call the primary.The primary picks a serial chunkservers and detects data corruption by checksumming order for all mutations to the chunk.All replicas follow this (Section 5.2).Once a problem surfaces,the data is restored order when applying mutations.Thus,the global mutation from valid replicas as soon as possible(Section 4.3).A chunk order is defined first by the lease grant order chosen by the is lost irreversibly only if all its replicas are lost before GFS master,and within a lease by the serial numbers assigned can react,typically within minutes.Even in this case.it be- by the primary. comes unavailable,not corrupted:applications receive clear The lease mechanism is designed to minimize manage- errors rather than corrupt data. ment overhead at the master.A lease has an initial timeout of 60 seconds.However,as long as the chunk is being mu- 2.7.2 Implications for Applications tated,the primary can request and typically receive exten- sions from the master indefinitely.These extension requests GFS applications can accommodate the relaxed consis- and grants are piggybacked on the HeartBeat messages reg- tency model with a few simple techniques already needed for ularly exchanged between the master and all chunkservers other purposes:relying on appends rather than overwrites. The master may sometimes try to revoke a lease before it checkpointing,and writing self-validating,self-identifying expires (e.g.,when the master wants to disable mutations records. on a file that is being renamed).Even if the master loses Practically all our applications mutate files by appending communication with a primary,it can safely grant a new rather than overwriting.In one typical use,a writer gener- lease to another replica after the old lease expires. ates a file from beginning to end.It atomically renames the In Figure 2,we illustrate this process by following the file to a permanent name after writing all the data,or pe- control flow of a write through these numbered steps. riodically checkpoints how much has been successfully writ- ten.Checkpoints may also include application-level check- 1.The client asks the master which chunkserver holds sums.Readers verify and process only the file region up the current lease for the chunk and the locations of to the last checkpoint,which is known to be in the defined the other replicas.If no one has a lease,the master state.Regardless of consistency and concurrency issues.this grants one to a replica it chooses (not shown). approach has served us well.Appending is far more effi- 2.The master replies with the identity of the primary and cient and more resilient to application failures than random the locations of the other (secondary)replicas.The writes.Checkpointing allows writers to restart incremen- client caches this data for future mutations.It needs tally and keeps readers from processing successfully written to contact the master again only when the primary
regions. The applications do not need to further distinguish between different kinds of undefined regions. Data mutations may be writes or record appends. A write causes data to be written at an application-specified file offset. A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations, but at an offset of GFS’s choosing (Section 3.3). (In contrast, a “regular” append is merely a write at an offset that the client believes to be the current end of file.) The offset is returned to the client and marks the beginning of a defined region that contains the record. In addition, GFS may insert padding or record duplicates in between. They occupy regions considered to be inconsistent and are typically dwarfed by the amount of user data. After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by (a) applying mutations to a chunkin the same order on all its replicas (Section 3.1), and (b) using chunkversion numbers to detect any replica that has become stale because it has missed mutations while its chunkserver was down (Section 4.5). Stale replicas will never be involved in a mutation or given to clients asking the master for chunk locations. They are garbage collected at the earliest opportunity. Since clients cache chunklocations, they may read from a stale replica before that information is refreshed. This window is limited by the cache entry’s timeout and the next open of the file, which purges from the cache all chunkinformation for that file. Moreover, as most of our files are append-only, a stale replica usually returns a premature end of chunkrather than outdated data. When a reader retries and contacts the master, it will immediately get current chunklocations. Long after a successful mutation, component failures can of course still corrupt or destroy data. GFS identifies failed chunkservers by regular handshakes between master and all chunkservers and detects data corruption by checksumming (Section 5.2). Once a problem surfaces, the data is restored from valid replicas as soon as possible (Section 4.3). A chunk is lost irreversibly only if all its replicas are lost before GFS can react, typically within minutes. Even in this case, it becomes unavailable, not corrupted: applications receive clear errors rather than corrupt data. 2.7.2 Implications for Applications GFS applications can accommodate the relaxed consistency model with a few simple techniques already needed for other purposes: relying on appends rather than overwrites, checkpointing, and writing self-validating, self-identifying records. Practically all our applications mutate files by appending rather than overwriting. In one typical use, a writer generates a file from beginning to end. It atomically renames the file to a permanent name after writing all the data, or periodically checkpoints how much has been successfully written. Checkpoints may also include application-level checksums. Readers verify and process only the file region up to the last checkpoint, which is known to be in the defined state. Regardless of consistency and concurrency issues, this approach has served us well. Appending is far more effi- cient and more resilient to application failures than random writes. Checkpointing allows writers to restart incrementally and keeps readers from processing successfully written file data that is still incomplete from the application’s perspective. In the other typical use, many writers concurrently append to a file for merged results or as a producer-consumer queue. Record append’s append-at-least-once semantics preserves each writer’s output. Readers deal with the occasional padding and duplicates as follows. Each record prepared by the writer contains extra information like checksums so that its validity can be verified. A reader can identify and discard extra padding and record fragments using the checksums. If it cannot tolerate the occasional duplicates (e.g., if they would trigger non-idempotent operations), it can filter them out using unique identifiers in the records, which are often needed anyway to name corresponding application entities such as web documents. These functionalities for record I/O (except duplicate removal) are in library code shared by our applications and applicable to other file interface implementations at Google. With that, the same sequence of records, plus rare duplicates, is always delivered to the record reader. 3. SYSTEM INTERACTIONS We designed the system to minimize the master’s involvement in all operations. With that background, we now describe how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snapshot. 3.1 Leases and Mutation Order A mutation is an operation that changes the contents or metadata of a chunksuch as a write or an append operation. Each mutation is performed at all the chunk’s replicas. We use leases to maintain a consistent mutation order across replicas. The master grants a chunklease to one of the replicas, which we call the primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. The lease mechanism is designed to minimize management overhead at the master. A lease has an initial timeout of 60 seconds. However, as long as the chunkis being mutated, the primary can request and typically receive extensions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. The master may sometimes try to revoke a lease before it expires (e.g., when the master wants to disable mutations on a file that is being renamed). Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires. In Figure 2, we illustrate this process by following the control flow of a write through these numbered steps. 1. The client asks the master which chunkserver holds the current lease for the chunkand the locations of the other replicas. If no one has a lease, the master grants one to a replica it chooses (not shown). 2. The master replies with the identity of the primary and the locations of the other (secondary) replicas. The client caches this data for future mutations. It needs to contact the master again only when the primary