分布式数据库Hypertable 0.9.7.6发布

jieforest · 发表于 2013-5-31 12:30

ACCESS GROUPS

Access Groups provide a way to control the physical storage of column data to optimize disk I/O. Access Groups are defined in the table schema and instruct Hypertable to physically store all data for columns within the same access group together on disk. This feature allows you optimize queries for columns that are accessed with high frequency by reducing the amount of data transferred from disk during query execution. Disk I/O is limited to just the data from the access groups of the columns specified in the query. For example, consider the following schema.

CREATE TABLE User (
name,
address,
photo,
profile,
ACCESS GROUP default (name, address, photo),
ACCESS GROUP profile (profile)
);

复制代码

jieforest · 发表于 2013-6-5 10:00

Hypertable will create two physical groupings of column data, one for the name, address, and photo columns, and another for the profile column. The following diagram illustrates this physical grouping.

jieforest · 发表于 2013-6-5 10:00

Consider the following query for the profile column of the User table.

SELECT profile from User;

复制代码

The execution of this query will be efficient because only the data for the profile column will be transferred from disk during query execution.

jieforest · 发表于 2013-6-5 10:01

RANGESERVER INSERT HANDLING

The following diagram illustrates how inserts are handled inside the RangeServer.

jieforest · 发表于 2013-6-5 10:01

Step 1: Commit Log - Inserts are appended to the Commit log which resides in the distributed filesystem (DFS) and followed by a sync operations that tells the filesystem to persist any buffered writes to disk. If multiple insert requests are pending, or a GROUP_COMMIT_INTERVAL is configured for the table, then the sync operation is performed after multiple Commit log appends to improve throughput.

Step 2: Add to map - The inserts are added to the in-memory CellCache (equivalent to the Memtable in the Bigtable paper).

Step 3: Acknowledge - Acknowledgement is sent back to the application.

Background Maintenance Threads - Over time, as the CellCaches fill memory, background maintenance threads will "spill" the in-memory CellCache data to on-disk CellStore files which frees up memory inside the RangeServer which allows it to accept more inserts.

This design makes Hypertable writes durable and consistent because inserts are not acknowledged until the Commit log has been successfully written to.

jieforest · 发表于 2013-6-5 10:02

RANGESERVER QUERY HANDLING

The following diagram illustrates how queries are handled inside the RangeServer.

Data for a range can reside in the in-memory CellCache as well as in some number of on-disk CellStores (see following section). To evaluate a query over a table range, the RangeServer must create a unified view of the data, which it does through the use of a MergeScanner object, which merges together the sorted key/value pairs coming from the CellCache and CellStores. This unified stream of key/value pairs is then filtered to produce the desired results.

jieforest · 发表于 2013-6-6 09:41

CELLSTORE FORMAT

Over time, the RangeServers will write in-memory CellCaches to on-disk files, called CellStores, whose format is illustrated in the illustration to the right.  The following describes the sections of the CellStore file format.

Compressed blocks of cells (key/value pairs) - This section consists of a series of sorted blocks of compressed sorted key/value pairs.  By default, the compressed blocks are approximately 64KB in size.
This size can be controlled by the Hypertable.RangeServer.CellStore.

DefaultBlockSize property.  These blocks are the minimum unit of data transfer from disk.
Bloom Filter - After the compressed blocks of key/value pairs comes the bloom filter.  This is a probabalistic data structure that describes the keys that exist (with high likelihood) in the CellStore.  It also signals if a key is definitively not present, which helps the RangeServer avoid unnecessary block transfer and decompression.

Block Index - After the bloom filter comes the block index.  This index lists, for each block, the last key in the block followed by the block offset.

Trailer - At the end of the CellStore is the trailer.  The trailer contains general statistics about the CellStore and includes the version number of the CellStore format so that the RangeServer can interpret it correctly.

jieforest · 发表于 2013-6-6 09:41

QUERY ROUTING

The following diagram illustrates the data structures that support the query routing algorithm which is how queries get sent to the relevant RangeServers.

jieforest · 发表于 2013-6-6 09:41

METADATA Table

There exists a special table in Hypertable called the METADATA table that contains a row for each range in the system. There is a column Location, that indicates which RangeServer is currently serving the range. Though the diagram shows IP addresses in the Location column, the system stores a proxy name for the RangeServer in that column so that the system can be run on public clouds such as Amazon's EC2 and operate correctly in the face of server restarts and IP address changes. A two-level hierarchy is overlaid on top of the METADATA table. The first range is the ROOT range which contains pointers to the second-level ranges which, in turn, contain pointers to the USER ranges, which are the ranges that make up regular user or application defined tables.

Client Library

The Client Library provides the application programming interface (API) that allows an application to talk to Hypertable. This library is linked into each Hypertable application and handles query routing. The client library includes a METADATA cache which contains the range location information obtained by walking the METADATA hierarchy. Most application range location requests are served directly out of this cache. The ThriftBroker, which provides a high-level language interface to Hypertable, links against the client library and is a long-lived process, so its METADATA cache is usually fresh and populated. For this reason, we recommend that short lived applications (e.g. CGI programs) use the Thrift interface to avoid having to walk the METADATA hierarchy for each request.

jieforest · 发表于 2013-6-6 09:42

ADAPTIVE MEMORY ALLOCATION

The following diagram illustrates how the RangeServer adapts its memory usage based on changes in workload.

Under write-heavy workload, the RangeServer will give more memory to the CellCaches so that they can grow as large as possible, which minimizes the amount of spilling and merging work required. Under read-heavy workload, the system gives most of the memory to the block cache, which significantly improves query throughput and latency.

分布式数据库Hypertable 0.9.7.6发布

浏览过的版块