Architecture Summit 2008 - Write up

Sky-Tiger · 发表于 2009-4-27 22:42

A couple of interesting Points discovered during this session

Apple's Macbooks outnumbered PCs almost 3 to 1,
Flickr deploys new code about 10 times per day. Cal: "Our last deployment was 8minutes ago."

Sky-Tiger · 发表于 2009-4-27 22:42

Data Management discussion

Day two of the meeting started with a discussion on data management patterns. We started the discussion with an attempt to answer what brings people to look into distributed data management in the first place. We laid out three concrete options:

-Performance
-Latency
-Scalability

We started the discussion with the thought that all points are equally important but we quickly agreed that in reality the cost of addressing all at the same time may not always bea good fit for everyone. For example, Memcached is a good example of a solution that was designed to address performance and scalability but completely ignore dealing with the reliability and consistency aspects. On the other hand, all the vendors in the Room(GigaSpaces, Oracle, Terracotta) spent a lot of effort ensuring the reliability and the consistency of the memory-based solution. The adoption of both types of solutions indicates that people are willing to trade some of the functionality for cost or for other reasons and all-or nothing reasoning doesn't seem to apply here.

John Davies raised another driver for moving to distributed data management: availability/reliability. The fact that your data is not stored in a centralized location means that the chances for total failure are smaller. This sparked a small debate on whether indeed reliability is one of the driving forces or just a feature and not necessarily a driving force.During this debate it was clear that most people have a slightly different definition in mind when they use the term reliability. John Purdy suggested a definition for reliability that was quickly accepted by everyone in the room:

Availability/Reliability can be broken down into the following properties:

* Durability
* Consistency
* Availability and
* MTBF

After we agreed on the baseline of what brings people to look into distributed data management in the first place and a basic definition for availability, the discussion moved to cover other aspects of distributed data management.

What is the impact of distributed data management on latency?

* Affinity – in many cases the number of hops required to access the data has a strong impact on latency. Ensuring that the business logic is close to where the data is can reduce that overhead significantly. That statement holds true for any form of data management, distributed or centralized. It becomes more interesting when dealing with distributed management scenarios, since the data can be spread over the network and therefore you may have different latency for different sets of data. A common option to avoid this overhead is to execute the business logic where the data is. A service-oriented architecture can provide a good example of co-locating logic and data, where a service is responsible for both its data and the operations performed on it. An algorithmic trading
application is another example – each node is responsible for very fast processing on a
subset of the overall data.

Sky-Tiger · 发表于 2009-4-27 22:42

*Latency under load – contention on the same lock or a large number of concurrent users increase the time it takes for the data server to serve these requests, and impacts the overall latency. Distributed data management can smooth out the impact of these two factors by spreading the load and contestations between multiple data partitions.

*Latency vs. consistency tradeoff – guaranteeing consistent, ordered operations requires serialization, which increases latency. There is an explicit tradeoff to be made here – one can improve latency, but at the cost of relaxing consistency.

*Latency and multiple replicas – for read operations, reading data in parallel from multiple replicas can improve latency, because you take advantage of the faster responders. (If you need to read from only k replicas in order to be sure you are reading the correct value, the overall latency is the latency of the k'th fastest, not the overall slowest.)

Is data distribution a leaky abstraction?

There have been many attempts to make the transition from centralized data model to distributed data model seamless through abstraction. There is a good chance that things wouldn't work as expected with this level of abstraction – for example, join queries work very differently if all the data is placed in one centralized location or if it spread across distributed data partitions. The same thing applies for any aggregated function such as SUM, AND, or any blocking operations.

Hiding these details from the application can lead to bad design, which in many cases will only be discovered at later stages of the application development. On the other hand,forcing explicit change on the application can also be a painful process. So the question is,at which level abstraction can become useful and at what point it becomes "leaky". I suggested that the best measure is the chances of failure. If in 80% of the cases chances are that users will choose the wrong option – this is a clear indication for a leaky abstraction. There are various way to deal with that:

1. Avoid any abstraction and force explicit change in the application code.
2. Provide an abstraction, but throw a warning when there is a chance for bad use.
3. Provide an abstraction, but also explicit semantics (through annotations or special query semantics) for dealing with distributed data management, such as affinity semantics,semantics for parallel execution and map/reduce semantics.

Sky-Tiger · 发表于 2009-4-27 22:42

In a follow-up discussion with John P. and Shay B. it seemed that additional semantics (affinity, parallel, map/reduce) to JPA could serve as a good starting point for such an abstraction for Java developers.

What will be the database's role in future architecture?

Databases are largely valuable in providing logical-physical abstractions, and are probably not going disappear from our world any time soon. However, at the same time it is clear that databases are not going to serve as a general-purpose solution for all data requirements. Moving to distributed data management is becoming more common, as value of data grows on the one hand, and the need for scaling and faster performance for
processing the data emerges on the other hand. Currently there are three main
approaches to this challenge:

Distributed database (Similar to Oracle RAC) – with this option the database is broken into partitions and provides a single driver for interacting with these databases.

Distributed Caching – in this case the database is kept centralized and offload a large part of the read load by putting the data in read-mostly cache.

In-Memory Data Grids – in this case a data grid is used as the system of record instead of the database. The database is used as a background service which maintains the data in a durable storage and used as a read-only access point.

All options require a change – a common fallacy is that if you're already using a database,choosing database partitioning is the most natural next step because it is "seamless". If you choose to change a database from a centralized model to a distributed model, you will most likely need to re-design your schema domain model. You will need to change all applications accordingly to conform to this new schema change. Since in most cases the database is used as a central piece that serves lots of applications (online, reporting,legacy), this change is going to have large impact.

Sky-Tiger · 发表于 2009-4-27 22:42

On the other hand, caching seem to be a simpler tactical change that helps to optimize the existing system and allows you to apply the change only for the application that needs it most. In this case, the change is significantly smaller, because you only need to change the read queries that hit the database most.

Data grid goes a step further and provides the ability to completely decouple the database while at the same time keeping the database in sync with all changes made into the data grid. This solution is good for read/write scenarios and for cases where most of the frequently-used data can reside in-memory.

Every decision between these options has trade-offs and costs associated with it. In reality you'll rarely go through a process of examining all the options and the cost associated with each one of them. For example, choosing MySQL over Oracle will not always be cheaper if you already have Oracle in place, trained DBAs, etc. The license cost is only part of TCO and not always the most significant one. The same applies to Memcached or any other open source solution. Unfortunately, people rarely apply these cost measurements in their decision-making process (especially when dealing with free-license products). Applying these measures makes it much more likely to discover, sooner rather then later, that what seems to be a natural solution might not actually be the best suited for your needs,compared to the alternatives.

freedom2k · 发表于 2009-4-27 23:43

每天晚上发这么多，有多少人看呢

freedom2k · 发表于 2009-4-27 23:44

幸运红包。29PUB币!

justforregister · 发表于 2009-4-28 23:49

头晕啊

justforregister · 发表于 2009-4-28 23:51

原帖由 freedom2k 于 2009-4-27 23:43 发表
每天晚上发这么多，有多少人看呢