Bazaarvoice collects user generated content that deal with reviews, questions/answers, stories and other such things for various retailers and brands, and then we analyze and serve that information back up to our clients. We’ve been doing this for about seven years now, and our customers include very well known companies such as those shown on our website.
We started out by using MySQL in the classic master/slave horizontal scale out way. We found it just impossible to scale and grow write capacity with MySQL. So we started looking for something that was cloud friendly, which was very important to us. We needed something where, if any one machine goes down, it’s not a big deal, meaning our systems aren’t affected and that it can recover without human intervention.
Next, we needed a database that allowed for easy capacity expansion (especially write capacity) by simply adding new machines online. Having multiple data center support was also a very big deal, especially where we can write to multiple data centers at the same time. We didn’t want master/slave data centers but peer to peer data centers.
These things were key to why we chose Cassandra.
We experimented with MongoDB and we do use Hadoop for our analytics, but when we did our architecture comparisons with HBase and Mongo, and wrote a number of development prototypes, and we became convinced that Cassandra was the right way to go.
The multi-data center support, the masterless architecture, ease of administration, and the no single point of failure and constant availability in the cloud were the things that were key for us.
Cassandra is what we use for our primary datastore, but we’re big enough where other databases are also used for other things.
We use Cassandra for two main classes of data. One use case is that we take lots of product feeds from our customers and we maintain a big master catalog of all our customer’s products, names, categories and brands. So all of our customer metadata is maintained in Cassandra.
The second use case for Cassandra is one where we store all of the user generated content from all our customer’s sites. Whenever users submit something on a customer’s website, that’s fed into Cassandra, with feeds coming into different data centers. That’s all analyzed and then returned back to our customers.
We do have a data center in Dallas, but nearly all of our new development is being carried out on Amazon, spanning multiple cloud availability zones.
The biggest thing to get your head around is the data model differences. You need to think about how you’re going to read the data. We pretty much do all of our writing in Cassandra and then replicate that data over to ElasticSearch for various search and read operations. So for us, getting the schema right in Cassandra was very important.