Tomas Salfischberger Managing Director, Co-founder at Relay42
Relay42 is a Tag & Data Management Platform. With Tag Management we provide marketers a platform to integrate third-party tags such as Google Analytics and to have the flexibility to collect data for our data management offering. The data management platform collects visitor-level interaction data both from the website as well as across all marketing channels and other external sources, resulting in 2.5 to 3 billion events stored each month.
What makes our platform unique is that we don’t just create reports with this data but we actually link it to personalized actions across different marketing channels such as email, banner-advertisements, videos and even the clients website.
Natural need for NoSQL
The large volume of events naturally required us to look for alternatives to a traditional RDBMS from the start. We tested several NoSQL-type solutions, mainly for scalability and raw performance, however things like resilience and crash-recovery were also important factors for us. Cassandra (version 0.6 at the time) scored very well both in scalability as well as ease of deployment and a very solid architecture. In later versions we’ve seen this improve further and further allowing us to store more data per node with each new version.
Our platform is quite widely distributed, currently using 13 datacenters across all continents, because latency to end-users is very important to us. The biggest bulk of data are the raw-events which we store in our main cluster on private hardware, this cluster is approaching 50 billion events stored and growing rapidly.
The reason for private hardware is that our workloads are largely IO-bound at which bare-metal hardware is much better than cloud-instances. Another reason is the nature of the data, it is non-personally identifiable information, but we still consider it privacy sensitive and thus don’t want to store it on shared systems or the public cloud.
Words of wisdom
Don’t be afraid to store data multiple times. It might feel counter intuitive to keep large amounts of data around “just in case”, but we do exactly that. We store every raw event we have ever received and process the raw data into more meaningful information in later steps. By never deleting the raw data we can always go back and process it again. So if we develop a new feature that requires data we didn’t extract before or if we decide that some new format might help performance, we run a large Hadoop-job that goes back through the historic data and processes it all again.
The Cassandra community
We started back in the Cassandra 0.6 days and at that time we of course ran into some bugs here and there. Back in those days Cassandra wasn’t as widespread as it is now, so it was a bit harder to find solutions to common problems. The Cassandra developers have always been very responsive to problems or suggestions, so that was great. The current much larger community is very open, helpful, and still growing, which is a very good indicator of the maturity of Cassandra as a product.