Mick Wever Senior Developer at Finn.no
Hi, everyone. This is Matt Pfeil, Co-Founder of DataStax. I’m here with Mick Wever from Finn today. Mick, thank you for your time today; why don’t we kick things off by telling everyone what Finn does.
Hello! Yes, I work for a company called Finn, which is based in Norway. Finn is a classifieds website, Norway’s largest website in fact. We get 3.5 million visitors per week in a country of five million people. We deal with real estate, jobs, trading post (like ebay), vechicles, travel, and service, and we have a monopoly in most of those areas.
Sounds like you’ve been quite successful. How old is the company again? And how big is it in terms of employees?
The company started in early 2000. Today, we have 320 employees; about 120 of them are developers. Finn is a rather interesting company in that it comes from the newspaper corporation, Norway’s biggest media corporation; and in the late ’90s and near 2000, that corporation was wiling to cannibalize its own newspaper profits by starting this endeavor.
That seems like it was very forward-thinking, so that they cannibalize themselves rather than have somebody else do it. I applaud that effort.
It’s quite a refreshing story, when you have a look at newspaper corporations around the world.
Absolutely. So, how are you guys using Cassandra today?
Today, we have four projects which use Cassandra: Vetting, My Last Searches, My Inbox, and Event Statistics. With Vetting, we analyze each listing as it comes in, on a number of different dimensions, trying to detect the probability of it being scam or irrelevant or offensive. That’s also backed by our customer center, which manually vets. We look at those ads which have been flagged, and we look at what scores they got in the process, and we can help speed up that process.
That sounds a little bit like fraud detection.
It is, yes. And because for each ad we have different dimensions of analysis, it’s a very two-dimensional schema. It’s fitted into Cassandra very well.
That makes a lot of sense. What were the other use cases in a little more detail?
The next project was My Last Searches. It’s very much like when you go to your profile page in Google and you have a look at what your searches were over the last few weeks or months; so we let users keep track of their searches. Again, this is very much a two-dimensional schema. Each row is the user, and we add a new column for every search that comes in.
That’s a great use case for time series right there.
Yes, and again, the TTL (Time To Live) feature of Cassandra works really well. We can say we’re only interested in collecting three or six months worth of searches; let the data disappear after that.
The next project is My Inbox. This project stores all of the emails that we send and receive. We actually keep a copy—we have our own inbox for users today. This is pretty much exactly the same use case that Cassandra was for when it was originally at Facebook.
Interesting. So, you’re using Cassandra as a message storer.
Very cool. Can you share any information about how big that is or how many messages you guys handle?
I don’t have those numbers on me, but, at the moment, it’s still in a beta; you have to go into your profile and enable yourself as a beta user. But we’ve many GBs of data already, and millions of users using it, so it’s getting there. And it’s fast; we’re still keeping all the data in Cassandra — we’re not actually keeping the blogs in a separate store; it hasn’t been a problem.
That’s good to know. Is it all text? Or can users actually submit pictures or files or something like that?
We don’t support attachments yet.
Okay. That’ll be an interesting thing to watch just because we’ve seen several other companies, such as Openwave, utilize Cassandra for email or message processing; so, that is a somewhat common use case. So that’s great to hear.
Yep. The last project we have, which was our original reason to go into Cassandra, is Event Statistics. All of the statistics that we give back to users on their ad listings happens through a tracking process. And so, we have an in-house tracking system now, which collects events. Those events go into a raw store in Cassandra. Then, we have every minute Hadoop jobs running on that Cassandra store and aggregating it into the statistical views that we want.
That’s a great use case, too, to stroll down in Cassandra and utilize the analytical capabilities of Hadoop for fast results to be served back out of Cassandra. So that’s great to hear.
Out of curiosity, what was your initial motivation in looking at Cassandra, and what technology stack were you either using before or evaluating against it?
Well, this tracking solution originally came from us doing writes to Sybase and finding that it was taking up 50% of the database’s time. We have about 50 million ad pages viewed everyday, and we were finding in peak traffic that Sybase just couldn’t keep up… and that operations were, in fact, disabling that write operation during peak traffic. When user’s ads were getting the most traffic, we weren’t actually collecting statistics for them anymore. So, that gave us the need to rewrite that part of the application. At that point, we were looking for something which was scalable, asynchronous, and durable.
So to get events out of the system, we chose Scribe. Messages go along Scribe and then they go into Cassandra. There, we have one row for every minute, and each event is just a new column added to the row for the minute. And then we have Hadoop processing those rows.
So, this changes a number of things. It gave us command query separation. We had one product for the writing of the events and we had a different solution for aggregation rating; this has made operations a lot easier. Also, the “push on change” model, where we collect the raw data and later process it to aggregated data, has come with a lot of advantages. We’ve been able to also do a lot of analytical jobs on this data, we can also in hindsight, generate statistics on how often people repeat-visited your ad, or which referring domains your ad has, or what devices are most popular for your ad listing. So it was all of these possibilities that kind of pulled us toward Cassandra the most.
We looked at other databases, but they weren’t really in the same league. When it comes to distributed technologies, it seems to be only, really, Cassandra and Hadoop which know what they’re doing; they seem to have a shining brilliance and a patience to get things right. So, we saw quite early on that MongoDB and these other products… they just weren’t there.
And being Norway’s most popular website, that scale ability issue was a critical one for us at the beginning; we wanted fault tolerance, we wanted asynchronousity, and we wanted linear scalability.
Glad that Cassandra’s been performing well for you. Out of curiosity, what’s the environment that you’re deployed on look like in terms of the hardware?
Today, we have six nodes. They’re Xeon machines, 24 CPU, 50 gigs of RAM. They have a data disk of 5.5 terabytes each; that’s a RAID50 setup. They also have an SSD disk of 100GBs for commit logs; the Hadoop HDFS, and we also put the journal for the data disk on the SSD disk. We have also changed the Kernal to use a NOOP I/O scheduler, and those machines also have Hadoop task trackers running on them.
So you’re running Hadoop on the same machines as the Cassandra machines?
Interesting. That’s a great use case. In fact, I was just talking about that this morning with someone else, so that’s good to hear.
Yeah, we figured that out quite early. Also, it was Jonathan Ellis from DataStax that was quite definite that data-locality is a must. Even if there’s some penalty there with sequential disk I/O lost (from multiple applications running on the one machine), the benefit of data-locality is far more important.
These machines, they’re quite powerful; they’re not commodity hardware. One of the problems we had when we started two years ago was that we couldn’t get our hands on commodity machines. We went to operations, and the people that run our data centers… they just weren’t interested in them. In addition to that, the cost of labor in Norway is so high. It’s cheaper for them to buy one of these expensive servers than it is to buy a commodity machine and have to reboot it, or change a disk, or just simply walk into a data center.
That’s something that I think the world is going to see a change on, as it embraces Cloud more and more, is that Ops guys today are generally under the working environment that when a machine fails, they have to fix it immediately, as opposed to having many commodity machines that when one dies, you just don’t care because everything’s automated. That seems like an upcoming revolution.
Yes, there’s certainly a number of things happening there, changing operations. In the beginning, we couldn’t even get luck with disks; they were insisting that we used a network-attached solution, and we were kind of jumping up and down screaming, saying, “No, Cassandra won’t work like this. It’s got to have local disks.” And they’re like, “Oh, but we have to keep a backup of your data.” It’s like, “No, Cassandra has replicates. We don’t want your backup.” And we had to actually do benchmarking with Cassandra to show that network-attached disks were just not appropriate.
Just to wrap up here, my last question for you is what do you believe is the most important thing about Apache Cassandra? And, based on your earlier statement, I’m curious if you could comment on the importance of uptime and performance.
Apache Cassandra is a paradigm shift for our industry. It’s been a huge paradigm shift for my own thinking and programming, but I can see that it’s also for the industry. Our Sybase database… it still gets turned off a number of hours during big releases or when it gets upgraded. Cassandra completely changes that, because you can do rolling restarts, and you can choose between latency or uptime for an application, even
for specific use cases within a database. It’s a game changer; it completely changes the way we think about how we work with databases. It is a Java-based and open source product, and that’s even better.
That’s great. Mick, I really want to thank you for your time today. And if there’s anything else you’d like to add, the floor is yours.
No, that’s it for me. But I would like to say thank you very much for all the developers at Apache Cassandra. You’ve always helped us quickly. And I really appreciate reading through all of the issues, seeing that you guys are really doing it properly. Thank you for the time.
If you’d like to learn more about the development of Finn, check out their tech blog here: http://tech.finn.no/