Alan Coleman: Vice President of Engineering at VigLink
Brady Gentile: Community Manager at DataStax
TL;DR: VigLink helps publishers make money from blogs, forums and other kinds of sites by using affiliate marketing.
The first thing they use Cassandra for is impression time; they have a background process that performs natural language processing, identifies commercial mentions, and matches those mentions up with particular products available from particular merchants. They store the results of that analysis in Cassandra using a page identifier.
Their second use of Cassandra is for an analytics dashboard. What they do is aggregate performance data through a Hadoop cluster and store the aggregated performance dashboard data in a set of Cassandra column families.
VigLink moved off of an overstressed MySQL database and onto a Cassandra data store. They are completely in the cloud, running everything on AWS (Amazon Web Services) in two data centers; one data center is in Europe and the other in the US.
Today we have Alan Coleman, VP of Engineering at VigLink, here to talk to us about how they’re using Apache Cassandra. Alan, thanks for joining us. To get things started, could you tell us a little bit about what VigLink does?
At VigLink, we help publishers make money from blogs, forums and other kinds of sites by using affiliate marketing. Affiliate marketing is a mechanism where if you have a link to a merchant on your site and someone clicks on that link and buys something, that merchant will pay you a commission.
What VigLink does is completely automate the affiliation process, so the publisher doesn’t have to put special parameters in their link. We also (and probably more interestingly) analyze the content of pages, posts, and other data, to identify product references and other commercial terms. We then automatically convert those into affiliated links, which go directly to the appropriate product on the appropriate merchants site.
We use a fair amount of natural language processing to do that. We also have thousands of publishers and serve billions of impressions a week; we have fairly stringent uptime requirements for our service.
And how does Apache Cassandra fit into the mix?
We use Cassandra for a few things. The first is at impression time; what we have is a background process that performs natural language processing, identifies commercial mentions, and matches those mentions up with particular products available from particular merchants.
We store the results of that analysis in Cassandra using a page identifier. Whenever someone views one of those pages, that page calls back into our service. We look up in Cassandra the set of references for that page and the corresponding product destinations (stored in Cassandra); we then return that to the viewer. That’s a high volume and low latency mainstream use-case for Cassandra. We have a couple of column families which are highly indexed and perform those lookups for us.
Our other use of Cassandra is for an analytics dashboard. Our customers (the actual publishers) will log into our site to understand the performance of their site, how many clicks they are getting on affiliated links, how much money they are making, which they tend to care a lot about. What we do there is we aggregate a lot of the performance data through a Hadoop cluster and then store the aggregated performance dashboard data in a set of Cassandra column families. When a user logs in, and goes to look at their dashboard, that performance data is ready and quickly served for them instead of having to go off and run a bunch of gnarly queries.
What was your motivation for using Cassandra? Did you look at any other technologies before deciding on it?
The choice of Cassandra actually predates my arrival at the company. I’ve been here about seven months now and we were already pretty heavy users of Cassandra when I arrived. I had also worked with Cassandra at my previous company. The choice of Cassandra here I think is a fairly mainstream use of the product. What we have are some of the things that Cassandra was really designed for: a high volume key value problem, but one that’s a little more structurally complex which other data stores can’t handle. What I do know is that before I started, VigLink moved off of an overstressed MySQL database and onto the Cassandra data store.
Can you share with us some insights on what your deployment looks like? Are you hosting the cloud or in your own datacenter, how many the servers?
We are all in the cloud running everything on AWS (Amazon Web Services). We run two data centers right now, one in Europe and one in the US. The way our runtime services are structured essentially is as little clusters with several Tomcat servers that sit in front of a Cassandra server and a MySQL server with Memcache. The data in them is replicated from the main master stores. We have two or three of these little clusters in Europe and two or three in the US.
Because Cassandra has really nice replication, we’re able to scale out the service pretty easily. As far as dashboard and main site use, we maintain all of that in the cloud as well. All of that is stored in the US, where we have our master Cassandra store and a separate cluster for computing aggregates.
Are there any features that you’d like to see in future versions of Apache Cassandra that would make your experience maybe easier with it or more easy to use?
One thing that has been improving (and I think will help as it continues to improve) is automation of memory management. In early versions of Cassandra, you really had to tinker with the memory settings quite a bit; that’s gotten a ton better. As Cassandra gets more intelligent about how it manages memory overall across column families, etc., I think that will help us a lot. That’s sort of the main thing we find we have to still tinker with from time to time. The replication is quite reliable and we haven’t had any issues with that. We haven’t really had too much issue with management and monitoring either. Some of those core server memory management performance issues I think will help.
What’s your experience with the Apache Cassandra community?
Some of our developers have been involved and I know they’ve gotten really good and helpful responses from the community. I believe they’ve also been active and responded to others as well. It seems like a really nice vibrant community.
Alan, thank you so much for joining us today. We really appreciate hearing about how you guys are using Cassandra over at VigLink. Is there anything else that you’d like to add before we close out here?
Yeah, I’d like to give a little shout out to the Leveled Compaction Strategy. We just recently enabled that on our clusters and it’s really made a significant difference by lowering our average latency and making our latency much more consistent. I know that’s gotten pretty good reviews all over the place. It has certainly also worked well for us.