Posts Tagged nosql
I’m currently working on an interesting project that has some, from my perspective, pretty large-scale storage requirements – hundreds of millions of entities, around 1TB of storage volume, and needing to concurrently support thousands of writes and tens of thousands of reads per second. Reads are fine, but the number of writes is beyond what a conventional RDBMS can comfortably deal with. Some people are apparently able to do ~1000-1500 writes/second or even more, but in our current ecosystem, the biggest number we’ve been able to get to is about 400/second. Plus writes tend to get slower as table sizes grow, and we’re talking hundreds of millions of entries. So in the beginning of the project, we were looking at alternatives.
Our first stop, of course, was NoSQL databases. We checked out a fair few of them over a period of a month or two: HBase, MongoDB, Riak, etc. We spent some time getting comfortable with solutions that would permit us the guarantees we wanted in terms of not losing data or transactions in a world of eventual consistency. We came up with algorithms that we believed would work, but they made us unhappy because they were so complicated. If we could stick to ACID transactions our algorithms would be simpler and our lives would be so much easier. So we started to look for technologies that would help us do that, and found options like Gemfire and various sharding technologies for MySQL. We considered rolling our own sharding solution. And eventually, via a detour at Clustrix, we came across VoltDB, which is the solution we’re currently in production with (for those not familiar with VoltDB and too lazy to click on the link: it’s a clusterable in-memory datastore that optionally provides ACID transactions). This post describes what we found during our evaluation and what we’ve found in the months since when we’ve been using it in development and production.
I’ll admit it; I quickly fell in love with VoltDB. The first thing that caught my attention was an aspect of the architecture. Based on research involving the founder Michael Stonebraker, they concluded that traditional RDBMSes spend (or waste) most of their time on things that are due to disk IO and concurrency. So they removed disk IO and concurrency from the critical path. There’s obviously still disk IO in VoltDB, and they deal with a lot of aspects of concurrency. But the IO that gives durability happens before the transactions and is done in a way that is append-only rather than random-seek, which makes it way faster than mid-transaction IO (this is what Martin Fowler calls a memory image). And since the transactions themselves require no disk IO, they execute quickly, meaning that serialising them simplifies the execution logic without incurring a significant penalty due to lack of concurrency. It’s a solution that feels fundamentally right to me.
The second really good thing is the management interface/s. By that I mean that any administrative task I have done has felt simple and intuitive, whether it was done through their management tool (the one that comes with the Enterprise Edition) or through the command line. So installing, upgrading, setting up the domain model, monitoring the data store, everything has been possible to do with very little need to refer to documentation. That promises a low cost of operation and shows good understanding of how to engineer a data store that works in practical use.
Third, another thing the people at VoltDB have got right is developer feedback. The best example of how not to do error messages is Oracle, which excels in giving cryptic error messages without sufficient detail to understand where you went wrong. VoltDB, in contrast, will tell you as early as possible what you’ve done wrong. So, for instance, if your stored procedure tries to access a table that doesn’t exist, you get a compile-time error with all the information you need:
[java] 2011-11-25 13:01:24,124 INFO [VoltLog4jLogger.java:92] [UpsertAtom.class]: Compiling Statement: UPDATE atom SET label = ? WHERE atom_id = ? [java] 2011-11-25 13:01:24,199 INFO [VoltLog4jLogger.java:92] [UpsertAtom.class]: Compiling Statement: INSERT INTO atom VALUES (?, ?) [java] 2011-11-25 13:01:24,219 INFO [VoltLog4jLogger.java:92] [UpsertAtom.class]: Compiling Statement: SELECT * FROM atoms WHERE atom_id = ? [java] 2011-11-25 13:01:24,219 ERROR [VoltLog4jLogger.java:92] [UpsertAtom.class]: Failed to plan for statement type(SELECT) SELECT * FROM atoms WHERE atom_id = ? Error: "user lacks privilege or object not found: ATOMS" [java] 2011-11-25 13:01:24,219 ERROR [VoltLog4jLogger.java:83] Failed to compile XML [java] 2011-11-25 13:01:24,219 ERROR [VoltLog4jLogger.java:92] Failed to plan for statement type(SELECT) SELECT * FROM atoms WHERE atom_id = ? Error: "user lacks privilege or object not found: ATOMS" [java] 2011-11-25 13:01:24,219 ERROR [VoltLog4jLogger.java:92] Catalog compilation failed.
Fourth, the performance is amazing. A single-node VoltDB installation on my machine does about 85k writes/second, and the write speed doesn’t decrease noticeably with increasing data sizes. And that’s while the same machine is generating the load. I’m using an iMac with 8GB RAM and an i7 processor (quad core), so it’s not exactly low-end but certainly nowhere near top of the line. On our 3-node evaluation cluster, with a configuration that does synchronous command logging (needed for full durability, which we require) and storing one replica of each object (k-safety = 1), we were doing about 60k writes/second most of the time. Those machines are Dual Quad-Core AMD Opteron 2.1GHz with 32GB of RAM each. The bottleneck is probably the fsyncing to disks not optimised for fast fsync, or possibly the network configuration which is far from ideal – the machines should all be on the same switch, but they are in fact on two different ones, and I’m not even sure how many hops there are between the switches. One thing that slows the average throughput down to much less than 60k/second is the fact that we’re using the same physical disk for taking snapshots as for writing command logs and this means that when snapshots are taken, write performance goes down to about 2k/second:
Finally, in all our interactions with VoltDB, they’ve been both very pleasant, professional and exceptionally fast. Most of our issues were resolved very quickly – usually within hours, always within days, and I can’t think of a single one taking more than a week. This includes a couple of issues that were weird (in the sense of “hard to troubleshoot”) mistakes entirely on our side.
The first thing I noticed that I didn’t like about VoltDB was that they used an abstract class with some core methods being flagged as final as a crucial client class, VoltTableRow. This meant I couldn’t write this code to unit test something I wrote myself:
// mock and when are static imports from Mockito VoltTableRow row = mock(VoltTableRow.class); when(row.getLong(0)).thenReturn(1L); when(row.getString(1)).thenReturn("the atom"); assertThat(mapper.mapToEntity(row), equalTo(new Atom(new Id<Atom>(1), "the atom")));
In fact, I haven’t yet found a reasonable way of mocking or subclassing VoltTableRow to be able to use it in unit tests. Also, when I looked closer at the source code, I felt it coupled clients a little too closely to implementation details in the VoltDB client code. When I questioned this on the forums, I quickly got a response from VoltDB suggesting that maybe interfaces would have been a better choice, and since I agreed, I forgot about this issue. But it did leave a little aftertaste.
However, I got back to work, and once we had concluded our first bigger test where we inserted the 230M entries, I was back in love with VoltDB. In fact, that successful experiment made me feel we were about done with the evaluation, and that we didn’t need to continue looking at the other technologies we had slotted for a deeper evaluation. And of course, that’s when we hit our first serious problem. The cluster went down about 15 minutes after concluding the test (while I was on a conference call bragging about the results), and on trying to bring it back up, the recovery of the database failed. It turned out that the cluster went down due to incorrect operating system configuration on our part (we had failed to do everything mentioned in their release notes). But the recovery failure was more serious and happened due to a bug (which was fixed within a day or two). In our use case, a data store that goes down occasionally (rarely) is something we can deal with, as long as it doesn’t stay down for long, as long as performance in the general case is good enough, and above all, as long as we don’t lose data or updates. Luckily, after some hard work, VoltDB were able to help us restore the data – it was only test data, so the data wasn’t important for us. But it was important for us to prove that when things go wrong, we don’t lose data.
We also had one earlier incident when the system went down due to a bug that had been fixed in version 2.1.1 (we started out with 2.1), something I didn’t pay much attention to at the time. However, these incidents made me take a deeper look inside some of the core classes in VoltDB. I found that very much of the code looks like it’s been written by expert programmers – only experts in C/C++ as opposed to experts in idiomatic Java. Large classes (2500+ lines in some cases – and of course, the large classes are the key ones), large methods with huge cyclomatic complexity, exceptions being swallowed, examples of inappropriate use of inheritance, and a fair amount of oddity in terms of dealing with concurrency. There was nothing that actually struck me as a bug, but where there’s smoke, there’s usually a fire. And we had indications in the form of a few noticeable bugs even within an evaluation period that had only lasted a couple of weeks.
It’s an interesting aspect of VoltDB’s choice to open-source their product that we know the state of its insides as opposed to just its outsides in the form of APIs, tools and documentation. I guess open-sourcing your code is a double-edged sword: it’s great because it gives people a way to learn your product, find answers and suggest solutions and improvements. But it also means you can’t polish your product surface and pretend to the world that the inside too is shiny. Another thing I hadn’t thought of before is that if you open-source your product, your code becomes part of your advertising. You should make it shine, and this is something that VoltDB can improve. I think improvements there could lead to both better code quality and a product that is an easier sell.
I know I am picky about code and it’s very easy to find faults in any code you read – including your own. What’s more, some of that is a matter of taste rather than an objective truth, and it’s hard to know what you don’t like about some code is down to your taste, and what is actually objectively good or bad. I know for a fact that a lot of the code we have internally looks no prettier than the VoltDB code, and though I don’t like that fact or that code, the code basically works. And as Mik, a colleague reflected, “better the devil you know”. We have no idea what, for instance, the code inside Oracle looks like.
Despite the stability issue we ran into, and despite not quite liking the state of the Java code, we decided to go ahead with VoltDB, for the following reasons:
- The great support we’ve been getting from them.
- The fact that we can live with the database going down and the system being unavailable for short and rare periods, so if the ‘smoke’ of the Java code leads to fires, we should be able to deal with that.
- The fact that we can design our system in such a way that even if we would lose data, we can recreate it based on inputs and data stored in slower, but more reliable stores. This type of recovery is enabled by the blazing speed of VoltDB, which even on suboptimal hardware allows us to insert volumes corresponding roughly to our production requirements in a couple of hours.
- The fact that we don’t believe that the other options out there are better. All the technologies that we’ve been looking at as potential candidates are about as immature as products. Either, the internet abounds with reports of them losing data or big companies abandoning them as technologies, or they (like VoltDB) are young and lack significant numbers of public, high-volume clients. Also, our in-house experience of somewhat similar technologies such as HBase and Coherence has been that they far less pleasant working with in terms of ease of use, etc, than VoltDB.
- We’ve not yet lost data, and although it is easy to bring a VoltDB cluster or at least node down through what amounts to human error (using a non-deterministic ad-hoc SQL query has been our favourite way of doing that), we’ve not seen any issue that hasn’t been possible to recover from. The scare we had in the early stages actually turned out to be a positive thing in the end. It’s better to have seen evidence of an actual failure that led to a successful recovery than to have no experience of whether a recovery might work or not.
We’ve been in production – for a fraction of our customers – for a couple of months now, and things are working very smoothly. An interesting observation is that the sheer speed of VoltDB has enabled us to simplify our architecture pretty drastically in some places. Where we had anticipated having to do some kind of complicated caching or pre-loading of data, we can simply just read it straight out of the underlying store. I don’t think we’ve seen the end of that sort of simplification yet. Some of the reflexes you get about protecting your database/s are no longer valid when the data store is capable of doing hundreds of thousands of transactions per second.
Overall, we’re very pleased with our choice of VoltDB for the high transaction volume parts of our storage needs. The product is very well-engineered and easy to work with, and the support we’ve been getting has been fantastic.
I’ve spent the last couple of weeks trying to figure out how to design a fairly large system that needs to deal with hundreds of millions of objects and tens of thousands of transactions per second (both reads and writes). That kind of throughput is hard to do with a traditional RDBMS, although there are apparently some people that manage. The problem is those tricks seem to be very much about getting amazing performance out of single database nodes, and of course, if you run out of tricks when the load increases, you’re screwed. What I’m looking for is something that is more or less guaranteed to scale. In this particular case, availability and partition-tolerance are not very important as such, but consistency and scalable throughput are, as is the ability to recover reasonably quickly from disasters. Scalability and throughput is what you get from NoSQL systems, so that would seem to be the way to go.
The problem with NoSQL is of course the relaxation of consistency. Objects are replicated to different nodes, and this may lead to updates that create conflicting versions. These conflicts need to be detected and resolved. And that is hard to do. Detection is usually a bit easier than resolution, but there is a lot of variation in when it happens. Here are a few relatively common options for conflict resolution (that is, how to figure out the correct value when a conflict has been detected):
- Automatically using some version of ‘last writer wins’, perhaps augmented using vector clocks or something similar to increase the likelihood of being correct. This is only a likelihood, though, and at least for our current case, this is insufficient.
- Let readers resolve conflicts by giving them multiple versions of the data, if there have been conflicts (this is how Amazon’s Dynamo works, by the way).
- Make it possible for writers to resolve conflicts by presenting them with the current versions in case there are conflicts, or using something like a conditional put.
All of those are hard to do in a way that is correct, algorithmically efficient and doesn’t waste space by storing lots of transaction history. But there seems to be some hope: two of the most interesting papers I’ve read are (co-)authored by the same guy, Pat Helland. If I understand him right, he argues for a type of solution that is subtly different from all the NoSQL solutions I’ve come across so far. The salient points, in my opinion, are:
- Separate the system into two layers, a scale-agnostic business layer that knows how to process messages, but is completely and utterly unaware of the scale at which the system is running, and a scale-aware layer that has no idea what business logic is executed but that knows how many nodes are running and how data is distributed across the different nodes.
- Ensure that all business-level events triggered by messages are Associative, Commutative and Idempotent (adding Distributed into the mix, he calls this ACID2.0, I’ll use ACI for the first three letters from now on). This means that if two nodes have seen the same messages, they will have the same view of the world, irrespective of the order in which they arrive, and whether they received one or more copies of a particular message.
- Ensure that the scale-aware layer guarantees at-least once delivery of messages. That is, a sender can rely on the fact that a message will always arrive at the right destination, even in the face of failures, repartitioning of data, etc. The only way to do that is to occasionally have messages arrive more than once.
With a system that follows those principles, conflicts don’t happen, so you don’t need to resolve them. Think about that for a second – associativity and commutativity means that message processing is order-independent. Idempotence means that if you receive the same message twice, the second time has no effect. This means that any differences in world-view (that is, data stored locally) between two different nodes will always be resolved as soon as they have seen the same set of messages, which is essentially the definition of eventually consistent. At-least-once message delivery guarantees that all nodes that should receive a certain message will do so, sooner or later.
The key difference between Pat Helland’s architecture and today’s NoSQL solutions is the level at which world-view coordination is done. As far as I can understand, all current NoSQL systems coordinate at the data level by making sure that bits and bytes are propagated to the right receivers in the clusters. The problem is that once those bits and bytes differ, it is very hard to understand why they differ and what to do about the conflict. There’s no generic way, at the data level, to make messages ACI, but it can be done at the business logic level. The data-replicating systems don’t and can’t realistically preserve the business-level events that led to the changes in data.
It is hard to design a system whose operations are all ACI, but it is also hard to design a system that is guaranteed to resolve conflicts correctly. In the case I’m currently working on, it’s been far easier to figure out how to make business-level events associative, commutative and idempotent rather than deal with the conflicts – I feel my thinking about these concepts is still very naive, so I have no knowledge of whether that translates to most other problems or not. In fact, I should probably say I’m not sure that it is actually the case for our system either as it’s not been implemented yet. Ask me again in six months’ time. :)
It certainly does feel like replicating at the data level makes things harder, and instead replicating business-level messages that trigger ACI operations would make for a more natural architecture. The closest I’ve seen to a system that does that is SQLFire, which is in fact where I first found the reference to the “Life Beyond Distributed Transactions” paper. SQLFire uses the entity concept from “Life Beyond Distributed Transactions”, but as far as I can tell (their documentation isn’t great right now, and renders very poorly on Chrome), SQLFire, too, appears to do replication on the data level as opposed to the level of business messages. I’d be very interested in seeing how a NoSQL solution would pan out that just provided the scale-aware distribution layer that Pat Helland mentions. You wouldn’t even necessarily have to include the data stores in such a solution – that could be done in RDBMS:s for each node if you want those semantics, or using in-memory caches, or whatever. Maybe we’ll develop such a system in this project – highly unlikely as I don’t think that would be money well spent. That it would be fun is of course not a good-enough reason…