Rustam Aliyev: Author and Project Maintainer at ElasticInbox
Matt Pfeil: Co-Founder at DataStax
TL;DR: ElasticInbox is an open source project that aims to change how email storage is done.? It was designed from the ground up to be distributed and easily scalable. They wanted to design an email storage system that will be reliable and scale linearly on commodity hardware, simply by adding new nodes.
ElasticInbox relies on Cassandra for storing all message metadata and it uses the cloud blob storage like Amazon S3 for storing original message sources. There are many scenarios where Cassandra excels. For ElasticInbox, the schemaless data model of Cassandra was a great match for email structure.
ElasticInbox’s largest deployment, is running almost for two years in production.? It was started with Cassandra 0.8 and currently runs on version 1.2.? It is a cluster of seven nodes and has about 200GB of data.
For today’s Apache Cassandra Use Case interview, I’m joined by Rustam Aliev from ElasticInbox based in London, UK.? Welcome Rustam.? First question for you: Can you tell us a little bit about what ElasticInbox does and also your role there?
Sure, thanks for having me.? ElasticInbox is an open source project which aims to change how the email storage is done.? We designed it from the ground up to be distributed and easily scalable.? I designed and implemented the initial version and now I maintain the open source project.
Okay great. So now, what problem are you trying to solve with ElasticInbox?? There are obviously a lot of of email storage systems, what are the current problems you see with email storage systems?
Traditionally, an email storage system consists of some software which stores messages on locally mounted disks and scales through sharding of mailboxes. That’s not convenient when you have millions of users, because you get a bunch of problems such as rebalancing of shards and expensive hardware.?We wanted to design an email storage system that will be reliable and scale linearly on the commodity hardware, simply by adding new nodes.
You designed this system from the ground up; is that correct?
So if we wanted to use ElasticInbox, how would we go about doing that?
ElasticInbox basically relies on Cassandra for storing all message metadata and it uses the cloud blob storage like Amazon S3 for storing original message sources.?In the latest version that we released a week ago, you can now store message sources in Cassandra as well, thanks to the recent improvements in Cassandra for nodes with large amounts of data.
So if you want to install ElasticInbox you will need a minimal set up of Cassandra, which could be a single node, and you would need ElasticInbox, that’s it.? For message sources you can use your local file system or cloud blob store such as Amazon S3.
Great. Let’s talk a little bit about why you chose Cassandra for this project. Did you look at other technologies?? Were you familiar with Cassandra before and just said “okay, I’m going to use that”?? If you could, please talk a little bit about your journey to Cassandra for ElasticInbox.
When we started designing ElasticInbox when Cassandra version 0.7 was just released. We reviewed most of the similar open source NoSQL storage systems which were available; everything from MongoDB, CouchDB to HBase and Riak. One of the major goals was to have a system without a single point of failure and most of those solutions unfortunately do have some kind of single point of failure.
Another important factor was high availability out of the box. That left us with just two options – Cassandra and Riak. At that time, Cassandra 0.8 wasn’t released yet, but it had counters among its new features. Counters were really important for us. We rely on counters for many things such as a number of unread messages in the inbox. Another example is for the total number of messages and bytes in your mailbox.?For all of these we use counters. Riak didn’t support counters back then and Cassandra has some other advantages over Riak, so we chose Cassandra.
Okay and thank you for explaining that.? You touched on one of the things that we like to do in these interviews, which is pass along advise to other community members who are looking to get started with Cassandra. What advise would you pass to someone that is looking to get up and running?
There are many scenarios where Cassandra excels. For us, the schemaless data model of Cassandra was a great match with email structure.?For example, some email messages have “subject” or “cc” headers, whereas other emails do not. You don’t need to design schema, as you would do with traditional RDBMS.? So for those who are dealing with dynamic data, where schema varies between records or may change over time, I would advise you to explore Cassandra’s capabilities.
Will you talk a little bit about the environment that you use for ElasticInbox and maybe the version of Cassandra that you are running as well?
ElasticInbox’s largest deployment, that I’m aware of, is running almost for two years in production.? We started that with Cassandra version 0.8 and currently run on version 1.2.? It is a cluster of seven nodes and has about 200GB of data. That includes the metadata and the source of the small email messages.?Larger messages are stored in the cloud blob storage.? It’s a single datacenter deployment and uses only mechanical disks.? We typically see the response times for writes around a millisecond and for reads around 10ms. Right now there are over one hundred million emails stored in the system and at least one million emails are added every day.
Now when you say one hundred million emails stored; if I attach files, do they get stored in Cassandra or do they get stored in blob storage?? How big can this data get?
Each email is split into two parts: metadata and the message itself. We extract the metadata, index it and store it in Cassandra and then we look at the source of the message. If the original message is less than, let’s say, 20Kb we store it in Cassandra. Otherwise we store it in cloud blob storage.?Cassandra handles small data really well.?In our case, more than 90% of all messages are small and stored in Cassandra. With this model it is possible to scale to Petabytes and keep cost low.
Last question to you: You had mentioned early on that you presented at a meetup in London about your use of Cassandra.? How do you find the Cassandra community over there in London? What about the IRC, Apache Cassandra mailing list, things like that?
The community is amazing.?I think it is one of the most active communities around, at least among open source projects that I work with.?Very responsive and helpful.? You can get a response from the committers or experienced Cassandra users.?I remember the very first Cassandra London meet up with just a few guys in the pub. Today, there are typically more than 150 people trying to sign up.
That’s great to hear. And is there anything else that you would like to add, Rustam?
I am looking forward to a more stable version of 2.0. Also, there’s a proposal for counters improvements, which I hope we will see soon.?
Great. In regards to stability, I think 2.0.1 is now out; what’s your timeline for taking a look at it?
In the next versions we plan to migrate to the new Java driver and binary protocol. We will also try to benefit from the new features that were added in 2.0. Things like triggers particularly look interesting.