Application Failure Scenarios with Cassandra

jieforest · 发表于 2014-3-10 16:49

My Take-Aways

Redesign your relational tables when moving to Cassandra -- ideally, aim for a single row per concept.

Our initial failure in access control table design is a great example of how tables need to be fundamentally redesigned when moved from a relational to a non-relational structure.
Networking failures that do not affect gossip will not be detected by Cassandra, these are your application's job.

Because our Cassandra nodes gossip only on PublicNet, it will not detect Private Network failures. This leaves it to the application to recover from this failure scenario. It might be possible to alleviate this by setting up the Cassandra cluster to accept queries on PublicNet over SSL and having the application servers only use this address in a fallback case (e.g. defining our own query policy similar to theDCAwareRoundRobinPolicy).

We may also investigate whether expanding the Cassandra cluster to use the dc_local_address in a YAML Network Topology Strategy file rather than the default NetworkTopologyStrategy property file would similarly cause gossip to fail in a private network failure scenario.

If you have any differing state between requests, health checks are lies.

If there is any shared state between different requests to the same thread in a process-based web server (such as mod_wsgi), there is a potential for health checks to be inaccurate, meaning that unhealthy services are not be properly marked as down. Ideally individual WSGI processes would be health-checkable, or threads that were unhealthy would somehow kill themselves. In contrast, with a single-threaded web server, application state will be consistent between every request, making health checks actually meaningful.

jieforest · 发表于 2014-3-10 16:49

Though I've talked here about the failures we've run into, we've had a great time with Cassandra in general -- there's been great help available in the community and from the #datastax-drivers team in FreeNode in particular.

When I left academics four years ago there was a lot of buzz around Cassandra as an exciting new technology; however, with its reliance on Thrift for querying, it seemed best-suited to stick to Java applications. With the introduction of CQL3 and a drivers for general-purpose languages, it is a great tool for building high-availabilty applications across different datacenters.

jieforest · 发表于 2014-3-10 16:49

over.