Bryan Valentini: Senior Analytics Engineer at appssavvy
TL;DR: appsavvy is an advertising company based around delivering the right kinds of ads during the right points in a user’s activity flow. They might be playing a game or reading news content. Rather than try to jam the ads in front of their face all the time we wait for the right moment, the right break in the activity to deliver.
Every hour appssavvy receives millions of events, keeping track of what ads people see, whether they close the ad, share it on Facebook or ignore an ad; they need a database that can handle the load. When Bryan joined appssavvy they were facing scalinig issues with thier MySQL database and opted to replace it with Cassandra.
appssavvy now uses Apache Cassandra to store all the incoming events in a large batch processing system. Their Cassandra deployment is spread across two datacenters featuring 10 node and 5 node clusters.
Hello Planet Cassandra. Today we have Bryan of appsavvy here. Bryan, could you tell us a little bit about what appsavvy does and your role at appsavvy?
Sure. appsavvy is an advertising company based around delivering the right kinds of ads during the right points in a user’s activity flow. They might be playing a game or reading news content. Rather than try to jam the ads in front of their face all the time we wait for the right moment, the right break in the activity to deliver. It actually ends up being better for the user and often performs better than normal display, so what we do at appsavvy is to find the best times to deliver ads.
How are you using Apache Cassandra at appsavvy?
We use Apache Cassandra to store all the incoming events in a large batch processing system. Every hour we receive millions of events. We keep track of what ads people see, whether they close the ad, share it on Facebook or ignore an ad. We figure out which campaigns work with the best publishers and deliver good value for the brand. What Cassandra allows us to do is handle all these events in a robust fashion, and then export portions of the data to Hadoop.
Excellent. What was your motivation for choosing Apache Cassandra? Were there any other technologies that you evaluated it against?
Yes, indirectly. When I came to appsavvy we were writing all these events to MySQL. To deal with the scalability issues we decided to go with Cassandra. We tried MongoDB in the process, but it had trouble keeping up with the write load. What Cassandra was really great at was standing up on several nodes and naturally handling the replicated load. We felt pretty comfortable with the technical merits of Cassandra and wanted to make sure that engineering-wise it worked well for us.
That’s great. It sounds like you found the right solution at the right time for you. Could you share some insights into what your deployment looks like?
We initially deployed two datacenters. The first datacenter had about ten nodes and the other one about five. The primary, larger cluster asynchronously replicated everything to the smaller data center. We did as much of the analytics on the smaller data center while leaving the primary data center mostly unencumbered by the analytics process.
Absolutely. That’s interesting. For future versions of Apache Cassandra are there things that you plan on doing with your application that you would like to see Cassandra provide support for? Different features or anything like that?
There are quite a few features that I would like to try that exist already. Primarily, virtual node and the ability to do incremental maintenance steps. In the future I’d like to see the performance of the indices increase. We thought we might benefit from several indices aside from the primary. It turned out that every time we tried to compact those column families it took a really, really long time and we had to abandon that approach. The other thing that I’d like to see is a little bit more help on the administration side. We ran into a lot of strange bugs in the 1.1.X series with nodes go down or failing to leave the ring, so on and so forth. I know it’s a complicated problem. With the initial deployment there were a lot of issues that we ran into. Bugs were really concerning, seeing them in the mailing list.
So more automation you’re thinking in regards to the administration side?
I think there could be. Compare it to the CouchDB interface, managing the CouchDB cluster is quite nice and it simplifies the tasks that you normally end up doing with nodetool and whatever version of CQL or CLI you are working with. I think in terms of where Cassandra could definitely go is to build in automated management.
What is your experience with the community whether it be the virtual side of things with the mailing list or the physical side of things?
I’d really love to go to the DataStax Cassandra event and learn from people who work with Cassandra more often, day-to-day. My schedule doesn’t always allow me to learn everything I can about Cassandra. You can probably guess how it is. The virtual side, I’ve been pretty good. Usually people are very responsive on the mailing list, within a couple of days. There were a couple of times where things got a little hairy on our end and people gave their best shot on the mailing list. Which is more then you can ask for from a free service. We’re a small start up, we can’t always afford to go to these big master classes. It would be interesting to see a wider range of learning opportunities, if that makes sense.
Register for free virtual Cassandra training with DataStax Academy