把Apache Cassandra作为云数据库的评估

jieforest · 发表于 2012-7-23 00:46

Solving the Cloud Mixed-Workload Problem

A primary benefit that DataStax Enterprise provides to enterprises needing smart big data management capabilities is its ability to service real-time, analytic, and enterprise search data operations in the same database cluster without any of the loads impacting the other. The key to making this possible is the underlying architecture of Cassandra.

jieforest · 发表于 2012-7-23 00:48

Hadoop Analytics in the Cloud

Built into DataStax Enterprise is an enhanced Hadoop distribution that utilizes Cassandra for many of its core services. DataStax Enterprise provides integrated Hadoop MapReduce, Hive, Pig, Mahout, and job/task tracking capabilities, replacing Hadoop’s HDFS storage layer with Cassandra (CassandraFS).

The end product is a single integrated solution that provides increased reliability, simpler deployment, and lower total cost of ownership (TCO) than a traditional Hadoop solution. DataStax Enterprise also is fully compatible with existing HDFS, Hadoop, and Hive tools and utilities.

Another benefit of using Hadoop in DataStax Enterprise is that it eliminates the complexity and single points of failure of the typical Hadoop HDFS layer. From an operational standpoint, there is no need to set up a Hadoop name node, secondary name node, Zookeeper, and so on.

Instead, DataStax Enterprise provides a single layer in which every node is a peer of the others and automatically knows its position in the cluster. On startup, all DataStax Enterprise nodes automatically start a Hadoop task tracker, and one of the nodes is elected to be the job tracker.

If the job tracker node fails, the job tracker is automatically restarted on a different node. DataStax Enterprise utilizes full data locality awareness for Hadoop task assignment.

jieforest · 发表于 2012-7-23 00:50

Search With Solr in the Cloud

DataStax Enterprise includes strong enterprise search support via Lucene and Apache Solr. Coming from the Apache Lucene project, Solr is the most popular open source enterprise search platform in use today.

Solr’s primary features include robust full-text search, hit highlighting, faceted search, rich document (e.g., PDF, Microsoft Word) handling, and geospatial search.

By integrating Solr into the DataStax Enterprise big data platform, DataStax extends Solr’s capabilities and overcomes a number of shortcomings that native Solr has such as:

• Lack of data durability (community Solr has no write-ahead log, so data can be lost if a node crashes). No chance of data loss exists with Solr in DataStax Enterprise

• Solr’s write bottleneck, as all writes go through a single master. But with DataStax Enterprise, users can read and write to any Solr node in the cluster

• Replication and sharding of Solr, which is a manual process and requires careful planning for scaling and failover. DataStax Enterprise, however, supplies automatic sharding and no single point of failure

• Manual re-indexing of data. Indexes can be automatically rebuilt in DataStax Enterprise

• Writes to indexes in community Solr cannot span multiple data centers; there is only a single master that replicates via rsync. But, in DataStax Enterprise, multiple writes to search indexes in different data centers are merged together (i.e., writes can occur anywhere)

• Solr indexes in DataStax Enterprise can be dropped/recreated/rebuilt on the fly (versus how things are done in native Solr)

jieforest · 发表于 2012-7-23 00:51

In essence, in the same way that DataStax Enterprise takes Hadoop and delivers a fault-tolerant, no single point of failure, and dynamically scalable Hadoop/analytics system, it automatically does the same thing for Solr and enterprise search operations.

Using Cassandra as the underlying foundation, DataStax Enterprise allows search data to be written to any participating search node in a DataStax Enterprise cluster. New search nodes can be added online to increase both fault tolerance and performance, with gains being near linear in nature.

Those currently using Solr will be at home with DataStax Enterprise, as the solution is 100 percent Solr compatible, with all Solr utilities, APIs, and so on, included.

jieforest · 发表于 2012-7-27 11:39

A Complete Big Data Platform for the Cloud

A key benefit of DataStax Enterprise is the tight feedback loop it has between real-time applications and the analytics and search operations that naturally follow. Traditionally, users would be forced to move data between systems via complex ETL processes, or perform both functions on the same system with the risk of one impacting the other. In big data environments, this process can be time-consuming and burdensome.

With DataStax Enterprise, real-time, analytic, and search big data operations take place in the same distributed system, but users have the ability to dedicate certain nodes solely for analytics or search so their workloads don’t slow down real-time processing. Users simply define one or more replica groups, and configure the role of each – one or more Cassandra, Hadoop, or HDFS (i.e., HDFS without job/task tracker), and search/Solr nodes. Writes are instantly replicated between all nodes.

With DataStax Enterprise, users truly have the best of all worlds for big data management. They have all the power of Cassandra serving their highest-volume and high-velocity, real-time applications; the power of Hadoop, Hive, and Pig working directly against the same data for analytics; and Solr for enterprise search in the same distributed database. The result is smart workload isolation for big data applications that is much simpler to manage and more reliable than any alternative.

jieforest · 发表于 2012-7-28 06:47

Figure 3: DataStax Enterprise – real-time and analytic, and search data in one cloud database

jieforest · 发表于 2012-7-28 06:47

Visual Database Management

DataStax Enterprise includes a visual, browser-based management solution named OpsCenter Enterprise to manage and monitor cloud database deployments. OpsCenter Enterprise allows a developer or administrator to manage and monitor the health of cloud databases from a centralized web console.

jieforest · 发表于 2012-7-28 06:48

Figure 4: OpsCenter Enterprise database cluster ring view

jieforest · 发表于 2012-7-28 06:49

OpsCenter Enterprise uses an agent-based architecture to monitor and carry out tasks on each node in a DataStax Enterprise cluster. Through a graphical and intuitive point-and-click interface, a user can understand the state of a cluster, which nodes are up and down, and what type of performance users are experiencing. Key events are reported into a centralized dashboard displayed along with other vital statistics.

jieforest · 发表于 2012-7-28 06:51

Figure 5: OpsCenter dashboard

Analytic operations also can be monitored and controlled from within OpsCenter Enterprise:

Figure 6: OpsCenter analytic operations monitoring