Archive for October, 2011
I’ve spent the last couple of weeks trying to figure out how to design a fairly large system that needs to deal with hundreds of millions of objects and tens of thousands of transactions per second (both reads and writes). That kind of throughput is hard to do with a traditional RDBMS, although there are apparently some people that manage. The problem is those tricks seem to be very much about getting amazing performance out of single database nodes, and of course, if you run out of tricks when the load increases, you’re screwed. What I’m looking for is something that is more or less guaranteed to scale. In this particular case, availability and partition-tolerance are not very important as such, but consistency and scalable throughput are, as is the ability to recover reasonably quickly from disasters. Scalability and throughput is what you get from NoSQL systems, so that would seem to be the way to go.
The problem with NoSQL is of course the relaxation of consistency. Objects are replicated to different nodes, and this may lead to updates that create conflicting versions. These conflicts need to be detected and resolved. And that is hard to do. Detection is usually a bit easier than resolution, but there is a lot of variation in when it happens. Here are a few relatively common options for conflict resolution (that is, how to figure out the correct value when a conflict has been detected):
- Automatically using some version of ‘last writer wins’, perhaps augmented using vector clocks or something similar to increase the likelihood of being correct. This is only a likelihood, though, and at least for our current case, this is insufficient.
- Let readers resolve conflicts by giving them multiple versions of the data, if there have been conflicts (this is how Amazon’s Dynamo works, by the way).
- Make it possible for writers to resolve conflicts by presenting them with the current versions in case there are conflicts, or using something like a conditional put.
All of those are hard to do in a way that is correct, algorithmically efficient and doesn’t waste space by storing lots of transaction history. But there seems to be some hope: two of the most interesting papers I’ve read are (co-)authored by the same guy, Pat Helland. If I understand him right, he argues for a type of solution that is subtly different from all the NoSQL solutions I’ve come across so far. The salient points, in my opinion, are:
- Separate the system into two layers, a scale-agnostic business layer that knows how to process messages, but is completely and utterly unaware of the scale at which the system is running, and a scale-aware layer that has no idea what business logic is executed but that knows how many nodes are running and how data is distributed across the different nodes.
- Ensure that all business-level events triggered by messages are Associative, Commutative and Idempotent (adding Distributed into the mix, he calls this ACID2.0, I’ll use ACI for the first three letters from now on). This means that if two nodes have seen the same messages, they will have the same view of the world, irrespective of the order in which they arrive, and whether they received one or more copies of a particular message.
- Ensure that the scale-aware layer guarantees at-least once delivery of messages. That is, a sender can rely on the fact that a message will always arrive at the right destination, even in the face of failures, repartitioning of data, etc. The only way to do that is to occasionally have messages arrive more than once.
With a system that follows those principles, conflicts don’t happen, so you don’t need to resolve them. Think about that for a second – associativity and commutativity means that message processing is order-independent. Idempotence means that if you receive the same message twice, the second time has no effect. This means that any differences in world-view (that is, data stored locally) between two different nodes will always be resolved as soon as they have seen the same set of messages, which is essentially the definition of eventually consistent. At-least-once message delivery guarantees that all nodes that should receive a certain message will do so, sooner or later.
The key difference between Pat Helland’s architecture and today’s NoSQL solutions is the level at which world-view coordination is done. As far as I can understand, all current NoSQL systems coordinate at the data level by making sure that bits and bytes are propagated to the right receivers in the clusters. The problem is that once those bits and bytes differ, it is very hard to understand why they differ and what to do about the conflict. There’s no generic way, at the data level, to make messages ACI, but it can be done at the business logic level. The data-replicating systems don’t and can’t realistically preserve the business-level events that led to the changes in data.
It is hard to design a system whose operations are all ACI, but it is also hard to design a system that is guaranteed to resolve conflicts correctly. In the case I’m currently working on, it’s been far easier to figure out how to make business-level events associative, commutative and idempotent rather than deal with the conflicts – I feel my thinking about these concepts is still very naive, so I have no knowledge of whether that translates to most other problems or not. In fact, I should probably say I’m not sure that it is actually the case for our system either as it’s not been implemented yet. Ask me again in six months’ time. :)
It certainly does feel like replicating at the data level makes things harder, and instead replicating business-level messages that trigger ACI operations would make for a more natural architecture. The closest I’ve seen to a system that does that is SQLFire, which is in fact where I first found the reference to the “Life Beyond Distributed Transactions” paper. SQLFire uses the entity concept from “Life Beyond Distributed Transactions”, but as far as I can tell (their documentation isn’t great right now, and renders very poorly on Chrome), SQLFire, too, appears to do replication on the data level as opposed to the level of business messages. I’d be very interested in seeing how a NoSQL solution would pan out that just provided the scale-aware distribution layer that Pat Helland mentions. You wouldn’t even necessarily have to include the data stores in such a solution – that could be done in RDBMS:s for each node if you want those semantics, or using in-memory caches, or whatever. Maybe we’ll develop such a system in this project – highly unlikely as I don’t think that would be money well spent. That it would be fun is of course not a good-enough reason…