The Google File System

wangfans · 发表于 2013-7-4 17:01

file data that is still incomplete from the application’s perspective.
In the other typical use, many writers concurrently append
to a file for merged results or as a producer-consumer
queue. Record append’s append-at-least-once semantics preserves
each writer’s output.

wangfans · 发表于 2013-7-5 16:57

A reader can
identify and discard extra padding and record fragments
using the checksums. If it cannot tolerate the occasional
duplicates (e.g., if they would trigger non-idempotent operations),
it can filter them out using unique identifiers in
the records, which are often needed anyway to name corresponding
application entities such as web documents. These
functionalities for record I/O (except duplicate removal) are
in library code shared by our applications and applicable to
other file interface implementations at Google. With that,
the same sequence of records, plus rare duplicates, is always
delivered to the record reader.

wangfans · 发表于 2013-7-5 16:57

3. SYSTEM INTERACTIONS
We designed the system to minimize the master’s involvement
in all operations. With that background, we now describe
how the client, master, and chunkservers interact to
implement data mutations, atomic record append, and snapshot.

wangfans · 发表于 2013-7-5 16:57

3.1 Leases and Mutation Order
A mutation is an operation that changes the contents or
metadata of a chunksu ch as a write or an append operation.
Each mutation is performed at all the chunk’s replicas.
We use leases to maintain a consistent mutation order across
replicas. The master grants a chunklease to one of the replicas,
which we call the primary.

wangfans · 发表于 2013-7-5 16:57

The primary picks a serial
order for all mutations to the chunk. All replicas follow this
order when applying mutations. Thus, the global mutation
order is defined first by the lease grant order chosen by the
master, and within a lease by the serial numbers assigned
by the primary.

wangfans · 发表于 2013-7-5 16:58

The lease mechanism is designed to minimize management
overhead at the master. A lease has an initial timeout
of 60 seconds. However, as long as the chunki s being mutated,
the primary can request and typically receive extensions
from the master indefinitely.

wangfans · 发表于 2013-7-6 21:54

The lease mechanism is designed to minimize management
overhead at the master. A lease has an initial timeout
of 60 seconds. However, as long as the chunki s being mutated,
the primary can request and typically receive extensions
from the master indefinitely. These extension requests
and grants are piggybacked on the HeartBeat messages regularly
exchanged between the master and all chunkservers.

wangfans · 发表于 2013-7-6 21:54

The master may sometimes try to revoke a lease before it
expires (e.g., when the master wants to disable mutations
on a file that is being renamed). Even if the master loses
communication with a primary, it can safely grant a new
lease to another replica after the old lease expires.
In Figure 2, we illustrate this process by following the
control flow of a write through these numbered steps.

wangfans · 发表于 2013-7-6 21:54

1. The client asks the master which chunkserver holds
the current lease for the chunkan d the locations of
the other replicas. If no one has a lease, the master
grants one to a replica it chooses (not shown).
2. The master replies with the identity of the primary and
the locations of the other (secondary) replicas. The
client caches this data for future mutations. It needs
to contact the master again only when the primary

wangfans · 发表于 2013-7-6 21:54

Figure 2: Write Control and Data Flow
becomes unreachable or replies that it no longer holds
a lease.
3. The client pushes the data to all the replicas. A client
can do so in any order. Each chunkserver will store
the data in an internal LRU buffer cache until the
data is used or aged out. By decoupling the data flow
from the control flow, we can improve performance by
scheduling the expensive data flow based on the networkto
pology regardless of which chunkserver is the
primary. Section 3.2 discusses this further.