用HBase处理大数据

jieforest · 发表于 2013-12-17 15:34

Or do you? Maybe you are working on an environment monitoring project that will deploy a network of sensors around the world, and all these sensors will produce huge amounts of data.

Or maybe you are working on DNA sequencing. If you know or think you are going to have massive data storage requirements where the number of rows run into the billions and number of columns potentially in the millions, you should consider alternative databases such as HBase.

These new databases are designed from the ground-up to scale horizontally across clusters of commodity servers, as opposed to vertical scaling where you try to buy the next larger server (until there are no more bigger ones available anyway).

Enter HBase

HBase is a database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns. It is designed to run on a cluster of commodity servers and to automatically scale as more servers are added, while retaining the same performance.

In addition, it is fault tolerant precisely because data is divided across servers in the cluster and stored in a redundant file system such as the Hadoop Distributed File System (HDFS).

When (not if) servers fail, your data is safe, and the data is automatically re-balanced over the remaining servers until replacements are online. HBase is a strongly consistent data store; changes you make are immediately visible to all other clients.

jieforest · 发表于 2013-12-17 15:35

HBase is modeled after Google's Bigtable, which was described in a paper written by Google in 2006 as a "sparse, distributed, persistent multi-dimensional sorted map."

So if you are used to relational databases, then HBase will at first seem foreign. While it has the concept of tables, they are not like relational tables, nor does HBase support the typical RDBMS concepts of joins, indexes, ACID transactions, etc.

But even though you give those features up, you automatically and transparently gain scalability and fault-tolerance. HBase can be described as a key-value store with automatic data versioning.

You can CRUD (create, read, update, and delete) data just as you would expect. You can also performscans of HBase table rows, which are always stored in HBase tables in ascending sort order.

When you scan through HBase tables, rows are always returned in order by row key. Each row consists of a unique, sorted row key (think primary key in RDBMS terms) and an arbitrary number of columns, each column residing in a column family and having one or more versioned values. Values are simply byte arrays, and it's up to the application to transform these byte arrays as necessary to display and store them.

HBase does not attempt to hide this column-oriented data model from developers, and the Java APIs are decidedly more lower-level than other persistence APIs you might have worked with. For example, JPA (Java Persistence API) and even JDBC are much more abstracted than what you find in the HBase APIs. You are working with bare metal when dealing with HBase.

jieforest · 发表于 2013-12-17 15:35

Conclusion to Part 1

In this introductory blog we've learned that HBase is a non-relational, strongly consistent, distributed key-value store with automatic data versioning. It is horizontally scaleable via adding additional servers to a cluster, and provides fault-tolerance so data is not lost when (not if) servers fail.

We've also discussed a bit about how data is organized within HBase tables; specifically each row has a unique row key, some number of column families, and an arbitrary number of columns within a family. In the next blog, we'll take first steps with HBase by showing interaction via the HBase shell.

jieforest · 发表于 2013-12-17 15:35

References

HBase web site, http://hbase.apache.org/
HBase wiki, http://wiki.apache.org/hadoop/Hbase
HBase Reference Guide http://hbase.apache.org/book/book.html
HBase: The Definitive Guide, http://bit.ly/hbase-definitive-guide
Google Bigtable Paper, http://labs.google.com/papers/bigtable.html
Hadoop web site, http://hadoop.apache.org/
Hadoop: The Definitive Guide, http://bit.ly/hadoop-definitive-guide
Fallacies of Distributed Computing, http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing
HBase lightning talk slides, http://www.slideshare.net/scottleber/hbase-lightningtalk
Sample code, https://github.com/sleberknight/basic-hbase-examples

用HBase处理大数据

浏览过的版块