|
My Take-Aways
Redesign your relational tables when moving to Cassandra -- ideally, aim for a single row per concept.
Our initial failure in access control table design is a great example of how tables need to be fundamentally redesigned when moved from a relational to a non-relational structure.
Networking failures that do not affect gossip will not be detected by Cassandra, these are your application's job.
Because our Cassandra nodes gossip only on PublicNet, it will not detect Private Network failures. This leaves it to the application to recover from this failure scenario. It might be possible to alleviate this by setting up the Cassandra cluster to accept queries on PublicNet over SSL and having the application servers only use this address in a fallback case (e.g. defining our own query policy similar to theDCAwareRoundRobinPolicy).
We may also investigate whether expanding the Cassandra cluster to use the dc_local_address in a YAML Network Topology Strategy file rather than the default NetworkTopologyStrategy property file would similarly cause gossip to fail in a private network failure scenario.
If you have any differing state between requests, health checks are lies.
If there is any shared state between different requests to the same thread in a process-based web server (such as mod_wsgi), there is a potential for health checks to be inaccurate, meaning that unhealthy services are not be properly marked as down. Ideally individual WSGI processes would be health-checkable, or threads that were unhealthy would somehow kill themselves. In contrast, with a single-threaded web server, application state will be consistent between every request, making health checks actually meaningful.
|
|