The Google File System

wangfans · 发表于 2013-6-26 17:04

Another way to understand this design decision is to realize
that a chunkserver has the final word over what chunks
it does or does not have on its own disks. There is no point
in trying to maintain a consistent view of this information
on the master because errors on a chunkserver may cause
chunks to vanish spontaneously (e.g., a disk may go bad
and be disabled) or an operator may rename a chunkserver.

wangfans · 发表于 2013-6-30 14:57

2.6.3 Operation Log
The operation log contains a historical record of critical
metadata changes. It is central to GFS. Not only is it the
only persistent record of metadata, but it also serves as a
logical time line that defines the order of concurrent operations.
Files and chunks, as well as their versions (see
Section 4.5), are all uniquely and eternally identified by the
logical times at which they were created.

wangfans · 发表于 2013-6-30 14:57

logical times at which they were created.
Since the operation log is critical, we must store it reliably
and not make changes visible to clients until metadata
changes are made persistent. Otherwise, we effectively lose
the whole file system or recent client operations even if the
chunks themselves survive.

wangfans · 发表于 2013-6-30 14:57

Therefore, we replicate it on
multiple remote machines and respond to a client operation
only after flushing the corresponding log record to disk
both locally and remotely. The master batches several log
records together before flushing thereby reducing the impact
of flushing and replication on overall system throughput.

wangfans · 发表于 2013-6-30 14:58

The master recovers its file system state by replaying the
operation log. To minimize startup time, we must keep the
log small. The master checkpoints its state whenever the log
grows beyond a certain size so that it can recover by loading
the latest checkpoint from local disk and replaying only the

wangfans · 发表于 2013-6-30 14:58

Table 1: File Region State After Mutation
limited number of log records after that. The checkpoint is
in a compact B-tree like form that can be directly mapped
into memory and used for namespace lookup without extra
parsing. This further speeds up recovery and improves
availability.

wangfans · 发表于 2013-7-1 17:02

Because building a checkpoint can take a while, the master’s
internal state is structured in such a way that a new
checkpoint can be created without delaying incoming mutations.
The master switches to a new log file and creates the
new checkpoint in a separate thread. The new checkpoint
includes all mutations before the switch. It can be created
in a minute or so for a cluster with a few million files. When
completed, it is written to diskb oth locally and remotely.

wangfans · 发表于 2013-7-1 17:03

Recovery needs only the latest complete checkpoint and
subsequent log files. Older checkpoints and log files can
be freely deleted, though we keep a few around to guard
against catastrophes. A failure during checkpointing does
not affect correctness because the recovery code detects and
skips incomplete checkpoints.

wangfans · 发表于 2013-7-1 17:03

2.7 Consistency Model
GFS has a relaxed consistency model that supports our
highly distributed applications well but remains relatively
simple and efficient to implement. We now discuss GFS’s
guarantees and what they mean to applications. We also
highlight how GFS maintains these guarantees but leave the
details to other parts of the paper.

wangfans · 发表于 2013-7-1 17:03

2.7.1 Guarantees by GFS
File namespace mutations (e.g., file creation) are atomic.
They are handled exclusively by the master: namespace
locking guarantees atomicity and correctness (Section 4.1);
the master’s operation log defines a global total order of
these operations (Section 2.6.3).