Reddit started out as a place for people to go and post links, find interesting things on the Internet, and share information with other people. Over the years it has evolved and now we like to say it’s the “front page of the internet”; most of what is happening on the web you’ll find on Reddit. Users will submit links and other users will vote up and down those links, curating the content themselves. In addition to that, you have a huge number of sub-communities called SubReddits that have their own rules and traditions; they are generally specific content focused.
Some of the nice things about Cassandra, when comparing it to other offerings, is its self-healing feature in terms of read-repair; that was pretty enticing to us when we were first looking at it. If we could do a look-up and something’s behind, then we have a way to get that addressed instead of having to go manually dig it out.
In the distribution of the data on our PostgreSQL database we had the single master and that could only let us scale so far, making the horizontal scaling of Cassandra – being able to add nodes and scale out our writes – the largest of backers in the initial decision process.
When we started using Cassandra, we started on version 0.6, which was quite a while ago of course; we were kind of seen as the early adopters. There was a lot of fear on the internet about “why are you guys keeping all this important data in Cassandra” or “where is all this data getting stored.” But we felt pretty strongly about it; there were some hiccups along the way, but the flexibility that we got from Cassandra is probably the biggest reason why we kept on using it.
In times where there were errors, especially in the pre-1.0 versions, we were generally able to recover the data because of Cassandra’s self-healing nature. If something bad ever did occur in Cassandra, maybe on the software side or even on the hardware side, we were able to recover pretty reasonably, which rarely occurs in other database offerings. If you’re going to lose a large chunk of data, or something bad happens to some of your data, then you might be down for quite a while during a manual repair.
We started with 0.6 and had gone through each version since then: 0.7, 0.8, 0.9 and then finally to 1.0. For each one, we ran into our own little problems because we were using it so heavily, compared to lot of other people at the time. Because we were using it in so many different ways, we ran into a lot of use cases that probably a lot of other people didn’t hit so often; that was something pretty unique about us.
Since we hit the 1.0 version, I would say things have been pretty stable; we’re happy with how things have turned out and we’re looking forward to some of the stuff that’s coming up. We’re currently on version 1.1 right now and we’re looking at upgrading to 1.2 and some of the stuff that’s on the horizon as well.
The big thing that we’re interested in next with Cassandra are virtual nodes, or vnodes, are in version 1.2, but we’re not quite there yet. It looks really interesting to see how we can work with some of our data models and see how we can utilize vnodes. We don’t really use the CQL stuff on the application side but improvement of the command line utilities is something that, as an administrator, I am really looking forward to.
We are using Cassandra for quite a few different things actually. Most of the listings, and all of the links on Reddit come out of Cassandra in one way or another. The biggest thing we’re using Cassandra for is denormalized lookups; this is data that we need to pull out that’s in denormalized form, so we don’t have to hit our backend PostgreSQL database constantly. Also, any time that you would vote on something (making sure that little arrow is going to be colored in the next time you view the page) that’s going to be pulled up in Cassandra as well; most of the voting data is stored in Cassandra, along with your liked pages and SubReddit pages. We store our votes in multiple areas, and users make around 17 million votes a day; so it’s a lot of data getting tossed in there every day.
We have 10 nodes right now running on Amazon EC2 SSD instances, which are super hefty boxes and there’s roughly 2TB of data within Cassandra that we’re storing right now across those nodes; it’s all in version 1.17.
As you may know, Amazon has multiple availability zones and our future plans are to spread things across these multiple availability zones to make use of some of the things like global forum, to scale more horizontally and across multiple regions. This will help us create better availability for our users.
I would say the biggest thing to consider is what type of data model you want to lay out and how you can utilize some of Cassandra’s features in order to implement that data model. If you don’t have really a clear concept of what you want to use your primary data store for, then it might be difficult trying to figure out how you’re going to be using Cassandra, what type of data model you’re going to use or how many nodes you might need.
Make sure to get an idea of what you’re going to be storing in Cassandra and the model that you’d like to store it in. With Cassandra, you’re not locked into as much of a schema that you would normally see in SQL, so you can be much more flexible on that end and play to those features.
Cassandra has been a good piece of software; we’ve had our tribulations with it but in the end it’s been more positive than negative, so we’re pretty happy and looking forward to new features that are on their way.