查看: 7853|回复: 12

Application Failure Scenarios with Cassandra

[复制链接]
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
跳转到指定楼层
1#
发表于 2014-3-6 12:54 | 只看该作者 |只看大图 回帖奖励 |倒序浏览 |阅读模式
My application recently finished switching from a master-slave database configuration to Apache Cassandra (version 2.0.4). This took my team around four months, involved rewriting every part of our application that touched the database, and migrating all existing data (we managed to do this all without downtime!).

mysql dead. hail cassandra

— Dave (@tildedave) February 21, 2014

The availability guarantees provided by Cassandra were the main motivation for this switch: the database has no single point of failure. The week after we finished our switch, I had to fill out a business continuity plan: our application failure scenario went from the loss of asingle node (our MySQL master) to the simultaneous unrecoverable loss of three complete datacenters. Combined with a Global Traffic Manager setup across our three datacenters, Cassandra will allow our site to remain operational even if two of them fail, all without the loss of customer data.

Of course, even though the database has no single point of failure, that doesn't mean that your application stays up. It's no good having a highly available database if you only have one edge device handling all incoming traffic. During our database migration, we ran into a number of application-level failure scenarios -- some of these we induced ourselves through our test scenarios, while others happened to us in our lower environments (staging, preprod, test) during the slow march to production. Each of these failures was not due to a failure of the database, but the failure of components that we added on top of it.


论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
2#
 楼主| 发表于 2014-3-6 12:54 | 只看该作者
In this article, I'll go over some of the different failure scenarios we ran into, along with how we solved them (or in some situations, plan to solve them). Much of the work described here was done by my team and not by me personally. Though I eventually got into writing some of the code, I served mainly as an engineering manager on this project -- while I found some of these issues, handling the resolutions was mainly done by others on the team.

Querying Cassandra with the Datastax Python Cassandra Driver

The parts of our application that talk to the database are written in Python: we have application servers written in both Django and Twisted -- however, given that we don't use SQL any more, the project has been quickly shedding its 'Djangoisms'.

There are a number of Python Cassandra drivers out there; however many (for example, pycassa) didn't support CQL3, a SQL-like language for querying Apache Cassandra that was introduced in the Cassandra 1.2 release. From reading around online, CQL3 was clearly the "new way" for code to query Cassandra (as opposed to using Thrift directly). The Datastax Python cassandra-driver was an exciting new project with CQL3 support which handled a number of failure scenarios "out of the box". Though when we started the driver was beta software (it has since had a 1.0.0 release, which we are using in production today), we determined that it was the best choice for Cassandra integration.

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
3#
 楼主| 发表于 2014-3-6 12:55 | 只看该作者
There are some Getting Started docs and one of our team members wrote a guide for setting up SSL. The basic concept is that a thread creates a Session object which handles the communication with Cassandra by sending queries and returning results back.
  1. from cassandra import ConsistencyLevel
  2. from cassandra.cluster import Cluster
  3. from cassandra.policies import DCAwareRoundRobinPolicy
  4. from cassandra.query import SimpleStatement


  5. CASSANDRA = {
  6.     'NAME': 'mykeyspace',
  7.     'HOSTS': ['192.168.3.4', '192.168.3.5', '192.168.3.2', ],
  8.     'DATACENTER': 'ORD',
  9.     'PORT': 9042
  10. }

  11. c = Cluster(contact_points=CASSANDRA['HOSTS'],
  12.             port=CASSANDRA['PORT'],
  13.             load_balancing_policy=DCAwareRoundRobinPolicy(local_dc=CASSANDRA['DATACENTER']))

  14. session = c.connect(keyspace=CASSANDRA['NAME'])

  15. rows = session.execute('SELECT name FROM users WHERE username="davehking"')

  16. # Statements are better when you need to pass values in
  17. address_query = SimpleStatement('SELECT address FROM users where username=%(username)s',
  18.                                 consistency_level=ConsistencyLevel.LOCAL_QUORUM)

  19. rows = session.execute(address_query,
  20.                        parameters={ 'username': 'davehking' })

  21. # Prepared Statements are even better when you execute a query a lot
  22. prepared_address_query = session.prepare('SELECT address FROM users where username=%(username)s')
  23. prepared_address_query.consistency_level=ConsistencyLevel.LOCAL_QUORUM

  24. rows = session.execute(address_query,
  25.                        parameters={ 'username': 'davehking' })
  26. Our Cassandra Toplogy
复制代码

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
4#
 楼主| 发表于 2014-3-6 12:56 | 只看该作者
Our Cassandra Toplogy

We run 15 Cassandra nodes in production: 5 per datacenter, 3 datacenters, with 3 replicas of all data per datacenter. We use the NetworkTopologyStrategy and have the Cassandra nodes gossip over their public network interfaces (secured with SSL).
  1. # cassandra-topology.properties

  2. # DFW CASSANDRA NODES:
  3. 106.82.141.183 =DFW:RAC1
  4. 89.192.56.21 =DFW:RAC1
  5. 73.138.104.236 =DFW:RAC1
  6. 134.10.31.175 =DFW:RAC1
  7. 134.84.228.145 =DFW:RAC1

  8. # ORD CASSANDRA NODES:
  9. 63.241.224.74 =ORD:RAC1
  10. 74.254.172.14 =ORD:RAC1
  11. 43.157.11.149 =ORD:RAC1
  12. 238.152.78.73 =ORD:RAC1
  13. 89.133.93.162 =ORD:RAC1

  14. # SYD CASSANDRA NODES:
  15. 17.196.144.150 =SYD:RAC1
  16. 206.174.71.86 =SYD:RAC1
  17. 3.44.230.66 =SYD:RAC1
  18. 31.199.140.105 =SYD:RAC1
  19. 199.57.119.242 =SYD:RAC1
  20. # cassandra.yaml

  21. # The address to bind the Thrift RPC service and native transport
  22. # server -- clients connect here.
  23. #
  24. rpc_address: 192.168.4.8

  25. # Address to bind to and tell other Cassandra nodes to connect to. You
  26. # _must_ change this if you want multiple nodes to be able to
  27. # communicate!
  28. #
  29. listen_address: 43.157.11.149
复制代码

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
5#
 楼主| 发表于 2014-3-7 12:43 | 只看该作者
We use Rackspace Cloud Networks for querying Cassandra intra-DC a datacenter. While a Cassandra node is gossiping on its public interface 17.196.144.150 (listen_address in cassandra.yaml), it will be receiving queries on its private interface 192.168.4.3 (rpc_address in cassandra.yaml). Inside a datacenter, it's superior to use private networks to the Rackspace-provided ServiceNet for both security and speed purposes.

Application Failure Scenarios

With the context around our setup (both from an application and cluster perspective) clear, I'll now describe the issues that we ran into. Though I consider our setup fairly standard (NetworkTopologyStrategy, mod_wsgi for Python execution), we ran into several deep issues that were a natural result of these choices.

Cassandra Doesn't Care If Your rpc_address is Down

This difference in rpc_address and listen_address became a problem, as these interfaces could fail separately. During our testing, we ran into situations (though infrequent) where Cloud Servers couldn't communicate over their private interface but were still able to communicate over their public interface. In this case, the Cassandra nodes were still receiving gossip events over their public interface (listen_address) and so were not marked as down -- however, the nodes were unqueryable over their private interface (rpc_address).

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
6#
 楼主| 发表于 2014-3-7 12:44 | 只看该作者
The current version of the Datastax python-datastax properly handles what happens when a Cassandra node is down on itslisten_address: the Cassandra Cluster sends a STATUS_CHANGE message that the host is down and the driver updates its list of 'good' hosts, removing its rpc_address from the list of queryable nodes. However, in the event that rpc_address becomes unusable, noSTATUS_CHANGE message is sent, and queries on that Session will fail with an error from Cassandra: OperationTimedOut.

Our workaround here was to re-establish a session on this timeout exception being thrown and retry the query. The newly created session will identify that the partially failed node is unreachable, meaning it is not included as a candidate host to query, and subsequent queries will succeed.

Customer-Controlled Tombstones Became a Denial-of-Service

The first table we converted from MySQL was one that handled 'locking out' an account after too many failed logins. Our MySQL solution was a fairly straightforward relational solution: on a failed attempt, insert a record into a database. On a login attempt, query the database to determine the number of failed rows -- if the number exceeded a specific threshold, the login would be denied. This login attempt would be recorded as a failed attempt along and inserted into the database.

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
7#
 楼主| 发表于 2014-3-7 12:46 | 只看该作者
A direct translation of this into Cassandra, with a TTL to auto-expire failed login attempts after a 'cooling down period', ran into an issue where a single customer (me, using ab, in our test environment) was able to effect a denial of service for all logins through creating a large amount of garbage records. Once the records for these failed logins had expired, all queries to this table started timing out. The Cassandra logs pointed us to exactly why:
  1. WARN [ReadStage:99] 2013-12-06 19:42:52,406 SliceQueryFilter.java (line 209) Read 10001 live and 99693 tombstoned cells (see tombstone_warn_threshold)
  2. ERROR [ReadStage:155] 2013-12-06 20:04:49,876 SliceQueryFilter.java (line 200) Scanned over 100000 tombstones; query aborted (see tombstone_fail_threshold)
复制代码
By converting our relational table directly into a Cassandra column family, we inadvertantly introduced a Cassandra anti-pattern, which broke all logins through with a TombstoneOverwhelmingException. As a result of this, we rewrote our login throttling logic to not generate a row per failed login. I think this points to a general principle for column family design: if you're dealing with TTLed data, different customer actions should not generate different rows.

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
8#
 楼主| 发表于 2014-3-7 12:47 | 只看该作者
Partial Failures are worse than Total Failures

In contrast to bad column family design (something we honestly expected to fail at given our newness with the technology), the last set of issues we ran into were due to improper management of the Session object that handles the connection to Cassandra.

Creating a Session object requires the application node to communicate with the Cassandra cluster to determine which nodes are up and which are down. On average creating this session took 1 second; sometimes spiking up to 4 seconds; in contrast, once a session has been created, our most common database queries finish in less than 10ms. That's a difference factor of 100.

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
9#
 楼主| 发表于 2014-3-7 12:47 | 只看该作者


Because of this time disparity, the reuse of the Session between different requests is key to a performant application. Our Django servers run on Apache through mod_wsgi: each WSGI process reuses one Cassandra Session. We use the maximum-requests setting which eventually kills each WSGI process after a certain number of threads, creating a new one. This new process creates its ownSession object and uses it for its lifetime. (In contrast, our Twisted servers are single-threaded and create one Session that is used throughout the lifetime of the process -- generally, until it is restarted as part of our code deploy process).

Just as our Cassandra nodes sometimes lost private network connectivity (bringing down the rpc_address while keeing thelisten_address up), our application nodes also sometimes experienced failures in private network connectivity. In this case, every host in a mod_wsgi process's Session object would become marked down, eventually terminating the event loop for the control connection. This meant that every subsequent query with this Session would fail with a NoHostAvailable exception; the Session object will never recover and needs to be recreated.

使用道具 举报

回复
论坛徽章:
277
马上加薪
日期:2014-02-19 11:55:14马上有对象
日期:2014-02-19 11:55:14马上有钱
日期:2014-02-19 11:55:14马上有房
日期:2014-02-19 11:55:14马上有车
日期:2014-02-19 11:55:14马上有车
日期:2014-02-18 16:41:112014年新春福章
日期:2014-02-18 16:41:11版主9段
日期:2012-11-25 02:21:03ITPUB年度最佳版主
日期:2014-02-19 10:05:27现任管理团队成员
日期:2011-05-07 01:45:08
10#
 楼主| 发表于 2014-3-10 16:47 | 只看该作者
Making this failure scenario worse is that for our Django application servers, this is only a partial failure of the service. Some WSGI processes are healthy, meaning that they can query Cassandra and get results (notably, loadbalancer health checks will still sometimes succeed), while others are sick, throwing Internal Server errors on every attempt to query the database.

Though we use mod_wsgi with Apache, this failure scenario applies for any other approach where different processes respond to requests on the same network interface.


When our Twisted servers have suffered private network failure their health checks have completely failed, since there is only one thread of operation and so only one Session object created. With a failing health check, the service is taken out of rotation in the load balancer.

It's inconvenient that we have to manually restart them (though we could certainly automate that); however, they aren't serving any errors. I prefer this total failure without customer impact to a half-failure where application servers continue to serve 500s.

We haven't found the best solution for this issue yet. We're thinking of investigating history-aware health checks to get a better sense of whether or not a whole service is "sick". My preferred solution, rewriting our Django servers in single-threaded Twisted Web is probably too much work to justify the result.

使用道具 举报

回复

您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

TOP技术积分榜 社区积分榜 徽章 团队 统计 知识索引树 积分竞拍 文本模式 帮助
  ITPUB首页 | ITPUB论坛 | 数据库技术 | 企业信息化 | 开发技术 | 微软技术 | 软件工程与项目管理 | IBM技术园地 | 行业纵向讨论 | IT招聘 | IT文档
  ChinaUnix | ChinaUnix博客 | ChinaUnix论坛
CopyRight 1999-2011 itpub.net All Right Reserved. 北京盛拓优讯信息技术有限公司版权所有 联系我们 未成年人举报专区 
京ICP备16024965号-8  北京市公安局海淀分局网监中心备案编号:11010802021510 广播电视节目制作经营许可证:编号(京)字第1149号
  
快速回复 返回顶部 返回列表