Matt Pfeil Co-Founder of DataStax
Matt Conway CTO of Backupify
Matt Pfeil Welcome to this Cassandra Community Podcast. Today, I have Matt Conway, the VP of engineering for Backupify with me.
Matt Conway I’m actually the CTO now. We hired a VP of engineering so I could focus on the technology side, and he could focus on the people side.
Matt Pfeil Awesome. Are you enjoying that role, out of curiosity?
Matt Conway Oh yeah, it’s great. I get to do what I’m good at and like doing, while he gets to do what he’s good at and likes doing.
Matt Pfeil Matt, why don’t you tell our listeners a little bit about yourself and also what Backupify does?
Matt Conway Okay, I’m obviously the CTO of Backupify. I’ve been in software engineering since the early nineties. I’ve done a fair number of different things. I’ve worked in C, C++, Java. Now I’m mostly into Ruby. I’ve worked at a number of different companies; the last three have been all start-ups. I certainly prefer working with smaller companies, working on interesting new technologies like Cassandra. The start-up environment appeals to me.
Backupify is a start-up. Our primary business is backing up fast application data. If you are a company and you store all your business, important data, in things like Google Docs or Salesforce, we’ll do a backup for you and make sure you have a second copy of that data. While the data is typically safe in an application like Google Apps and Salesforce, they don’t necessarily protect you from yourself or from other employees in your company. It’s still pretty easy to lose data, and backing it up is still a very important strategy in the cloud as well as in the enterprise.
Matt Pfeil Can you tell us a little bit about how you utilize Cassandra?
Matt Conway If you think of stacked data as just a collection of records, every single record that you back up, we post an entry into Cassandra which acts as an entry of record as well as an index into blob storage. Stuff like Twitter, which doesn’t have large binary attachments, obviously doesn’t need the blob storage side. Stuff like Google Docs, where you have all these documents floating around, you’ll have a Cassandra record, and it’ll act as a pointer to S3, where we store these large objects. It is very write intensive. We do a fair amount of reading as well, mostly to calculate aggregate data, but obviously backup is a write-intensive operation.
Matt Pfeil What made you decide to switch to Cassandra over traditional technologies?
Matt Conway Having tried to shard databases in the past, I wasn’t looking forward to doing it again. Our use cases seemed like a natural fit for Cassandra; we didn’t need to do large multi-table joins or anything like that, we just needed to do a lot of write operations as reliably as possible. The nice thing about Cassandra is you can have a number of nodes go down, and they’re still writing without any interruption. It’s reliability, it’s redundancy, it just worked really well and appealed to me, rather than trying to build a reliable, relational database. Plus, when we started using it, its claim to fame was really good write performance. Since what we’re doing is writing more than we are reading, that also made it seem like a natural fit. Also, we run in Amazon EC2. EC2 instances do fail; they seem to fail more these days than they did in the early days, so it’s just a matter of rebuilding our node and they’re all ready to go.
And, even though we do a lot of writing, we do a fair amount of reads as well, mostly when we’re calculating aggregate data like counts and whatnot. Although, obviously Cassandra’s counters help, but there’s some things that you can’t really count on the fly, so we do a lot of writes, we do a lot of reads. In terms of numbers, we probably do about 20 to 25 thousand writes a minute. That’s the metric from within Cassandra, the storage property write operation. About two to three thousand reads a minute. Over 21 node clusters, Amazon, X1 Extra Larges, It works pretty well, it’s keeping up. I think we’re actually over provisioned. The hardware we have can handle a lot more, but it might as well be over, right?
Matt Pfeil Better to be safe than sorry, I guess is the old saying.
Matt Conway Definitely.
Matt Pfeil Great, and you’ve used Cassandra for several years. Can you maybe share a little bit about the original experience in terms of how you got started with it, and if you have any recommendations for someone who’s getting started for the first time?
Matt Conway We started on Version 0.6, and then fairly soon moved to 0.7. It was a little bit rocky at first, because it wasn’t quite mature enough then. We had some issues, but we just kept updating and eventually reached a state where we were completely hands-off and were happy. Other problems we had were treating it too much like a relational database, not modeling our data to fit the normal Cassandra pattern, with having wide rows. That was a major performance impact, and once we switched away from doing that to using wide rows for our data, it made a huge difference.
To advise new people starting, make sure you really understand how to model data in Cassandra. Don’t just think you’re a special case, it’s okay if you do have a lot of rows. It will eventually bite you. Definitely, having wide rows makes a big difference. Read up as much as you can about how other people are doing that, and try to follow their pattern.
Matt Pfeil I’ll second that and just say that data modeling is extremely important. For everyone who’s reading or listening to this podcast, check out the documentation on this website, planetcassandra.org. The last thing I’ll say then, is I really want to thank Matt for his time today. I’ve actually been a customer and user of yours for a couple of years now, and I will second that it is really important to protect yourself from yourself, because you never know when you’re going to delete something that you really wish you hadn’t. You guys offer a great service, and I wish you the best of success moving forward.
Matt Conway Thanks, pay attention in the upcoming months. We’re launching a developer program, a Backupify platform, so we can go much lighter on the applications we back up, and allow end users, end developers to actually integrate with our platform and tell us how to back up any given SaaS applications. We’ll do all the heavy lifting of doing the backup and storing it, and you just have to tell us which data is important so that we can pull it. It should be pretty interesting times; we’ll be using Cassandra for that as well. It’ll be our first foray into Cassandra 1.2, and even CQL. We are working wit the latest version of CQL, and so far it’s going pretty well.
Matt Pfeil Cool. I’m glad to hear that, and I want to thank all of our listeners for their time today and check Planet Cassandra for other 5-minute interviews with users in the Cassandra community.