Viktor Trako Senior Developer at Holiday Extras
Holiday Extras provides airport car parking, hotels, lounge, as well as a range of other add on products. Our goal is “Making Travel Easy”. I am a Software Engineer here at Holiday Extras.
Making data capture easy
We are using Cassandra 2.0 and CQL3. We are changing the way we collect web data and data generated from our service/micro services. From a high level view our architecture looks like this,
So services use a wrapper to send data to our collector api which simply logs the data. A flume agent then sinks the data to Cassandra where is ready to be picked and placed in our data warehouse.
We do this so we can allow services to capture any data (plus the required fields), but we want tight control on what goes into our data warehouse.
We are also only keeping 8 days worth of data in Cassandra so for each write we set an 8 day TTL on that row. We’re currently looking into using the wide row functionality with composite keys to that we can access our time-series data much easier.
At present, the architecture I explained above is for our back-end data collection and reporting. So at the moment Cassandra is used to store the data prior to it getting to our data warehouse.
Prior to this the data was pumped to our data warehouse from MySQL. So our web data is still fed from MySQL as we speak. I plan to change this going forward so that we make a more extensive use of Cassandra.
We did consider other technologies prior to deciding on Cassandra. We considered Elasticsearch, DynamoDB, Riak, Redshift.
We went with Cassandra for the following reasons:
- It fitted our use case in that our operations are write heavy.
- Preferred the masterless architecture and the redundancy gained from it.
- DynamoDB was a close contender but we discarded due to the size limit per row of 64Kb.
- Simply put, on writes Cassandra outperformed the rest.
To improve our read performance due to tombstones set frequently hence to reduce the number of rows returned by each node to the coordinator, gc grace is set to 24h and repair is ran daily.
We have staging and production environments. Both are in one datacenter hosted on AWS, so in one region, across 3 availability zones eu-west 1a, 1b, 1c.
In staging we have 4 nodes and in production we have 6 nodes spread evenly across each availability zone.
We are using the DataStax AMI provided and have chef-ed the infrastructure, so adding new nodes in the future as our requirements change is seamless.
Our volume of data is not huge at present, but it is growing linearly. Just our web data is reaching 3TB growing by 3 – 5GB daily.
Thoughts on getting started
There is a lot to take into account in order to tune the cluster to suit your use case well, so I would advise anyone to go through the documentation provided by DataStax to get a solid understanding of the Cassandra architecture. I recommend automating deployment via Chef or Puppet.
There are some really good presentations available on data modeling, as well as setting up your architecture, watch them all prior to making any design decisions.
Some I have watched that have helped are: C* Summit EU 2013: Apache Cassandra 2.0 — Data Model on Fire by DataStax’s Chief Evangelist Patrick McFadin and C* Summit 2013: How Not to Use Cassandra by Spotify’s Axel Liljencrantz.
The community is great and I have found it very useful, I have had quick responses to questions I have had on Twitter.