Outbrain helps people discover the most interesting, relevant and trusted content out there. We are the leading company for content discovery. Folks like CNN, Fox News, and lots of other premium publishers and companies trust us on thier sites. The company was founded in 2006 and we do somewhere around 90 billion recommendations on over 10 billion page views per month. Right now we reach ~86% of the online US population.
A number of years ago we were looking at what would be the best way to deploy our data across multiple data centers. A number of other databases we tried just wouldn’t work, and were operationally immature. Our primary reason for using Cassandra over something like HBase is the ease with which we can span multiple data centers; it’s very simple. We’re talking very little if any operational overhead in doing what we do today with Cassandra.
By contrast, try doing something like that with MySQL and have network partition or other failure can be expensive in terms of engineering time to repair in many cases.
During Hurricane Sandy, we lost an entire data center. Completely. Lost. It.
Our application fail-over resulted in us losing just a few moments of serving requests for a particular region of the country, but our data in Cassandra never went offline.
Then, when that data center came back online, we spent a lot of time manually rebuilding other databases that we use, shipping drives across country and what not. But for Cassandra, all we did was turn it back on in that data center and ran a single command, and it brought itself back up to speed with no other manual work.
We’ve been using Cassandra for quite a while. We started way back with version 0.4. As an ops guy, I like Cassandra because, after you do your data modeling, you can turn Cassandra on and pretty much leave it alone except when you need to add new nodes online to increase capacity.
We’ve got a number of Cassandra clusters. We have a general use cluster, an ‘on line’ cluster that is directly in the path of how fast we serve up recommendations (which is critical), a few clusters tied to specific algorithms or processes, and then a number of other production and development clusters.
We like standing up different Cassandra clusters that are devoted to a specific application domain vs. having one big multi-tenant cluster that services many different apps. That seems to give us the best performance and makes finding problems easier, plus it’s also simpler to manage from an automation perspective.
We also run these clusters across multiple data centers; we have three data centers now and more on the way.
Top to bottom we use open source technologies on the front and back end. On the back end we employ a menagerie of technologies and data stores. Cassandra is one of the central pieces of our back end and serves as one of our primary data stores We also use Hadoop and Hive heavily and rely on Solr, MogileFS, MySQL, Storm, Kafka for various products and algorithms. We even have a few MongoDB small instances for various projects.
We look at how the data will be queried, its size, and how it needs to be distributed. We might use things like MySQL for historical reasons and MongoDB for smaller tasks, and then Cassandra for situations where data doesn’t all fit into memory or where it spans multiple machines and possibly data centers.
Cassandra is something we have a pretty good handle on now and know what to expect from it. When we architect and design properly we trust it to not fold under pressure.
The primary thing is to do your data modeling in a way that fits with Cassandra. Don’t treat it like MySQL, Mongo, or a pure KV store and you’ll get the full benefits of it’s replication and caching models. Get that right and your management and maintenance isn’t hard. Any real problem we’ve had with Cassandra has occurred because we tried to treat the Cassandra data model like some other data store and you can’t do that.