Rohit Rai CEO and Founder at Tuplejump
We started Tuplejump, Inc. with the idea of democratizing (big)data science. Our goal is to build a platform that makes working with data (and specially big data) so easy that people who understand the domain and the data, should also be able to build data pipelines with almost minimal understanding of programming or infrastructure. My co-founder Satya (CTO) and I started Tuplejump over a year ago bringing together our experience in developing consumer and enterprise big data applications. In my current role as the CEO, I oversee marketing, sales, open source community outreach and try to write some code.
The Tuplejump Platform provides the infrastructure components, tools and blueprints to build big data powered applications. It can be thought of as a big data application server that provides infrastructural services for collecting, transforming, storing, analyzing and visualizing data.
Jumping to Cassandra
We first picked HDFS and HBase, as we had a lot of experience working with Hadoop. After a lot of efforts spent, we realized that the Ops were expensive and painful.
During the same time, based on clients’ future roadmaps and requirements we decided to switch from Hadoop to Spark. We then debated whether Hadoop clusters were really necessary to store data. We chose MongoDB but had problems with Ops again. We realized that we would not able to scale out our deployment efficiently.
We then analyzed various NoSQL solutions and brought down our choices to 2, Cassandra and Riak. We decided to go with Cassandra, as it was an Apache project and we felt that it had a more active community and was widely adopted at companies like Twitter, Linkedin, etc.
Serving customers across continents with a datacenter to match
The major reason we moved to Cassandra was linear scalability and simpler ops. When I say scalability I don’t mean just in terms of capacity, but also in performance. Being a horizontal platform we deal with a wide variety of use cases and work on both the read intensive and write intensive solutions.
For example in an IoT cloud service, we need very high write throughput for data coming from various devices. In the same service, when a user is browsing his device data we need low latency on reads. Cassandra comes in really handy with its high write throughput. We make sure that reads are done the “Cassandra way” and are hence very fast. We also make extensive use of counters and collections.
Since Cassandra is our only storage solution, we use it for structured data, un/semi-structured time-series and log data storage and also as a chunked file store sometimes. We love the fact that we don’t have to do anything hacky to support any of these different models.
Multi-datacenter deployment is crucial to us. When we are serving customers across continents, it is handy to have a datastore which can do the same.
Also, the way we can configure settings like replication available per keyspace or compression on column family, gives us the possibility of doing what is the best in a given scenario.
Cassandra, Akka, Spark, & Shark
We have been running with Cassandra in production for different clients with varied use cases for almost a year now, and we like what we are seeing. Scaling is linear and performance very good, but the most important thing is they are deterministic, no surprises!
We use Apache Cassandra, along with some extensions(SnackFS – a HDFS compatible file system on top of Cassandra and Stargate – A search server built into Cassandra) we have developed, as our data store for structured/unstructured data . Cassandra is used in the various steps in the flow.
1. Hydra is our high velocity message processing and ingestion framework. It uses Cassandra backed cache to store historical data required in various processing tasks. It also runs an Akka cluster over the Cassandra gossip cluster, eliminating the need of having multiple clusters. Cassandra can also be used by the application developers to provide durable mailboxes in case message delivery assurances are required.
2. The Calliope project started as a tool to provide easy to use APIs to consume data for batch processing in Spark. With Calliope and SnackFS we can store and consume any type of data (structures, unstructured and files) in Cassandra.
3. Cash (which is a part of Calliope) provides Shark integration with Cassandra, so you can use Shark to explore the data in Cassandra through ad-hoc hive queries. We will soon be releasing a Hive metastore implementation over Cassandra, so we won’t need a RDBMS to run Hive/Shark when using Tuplejump Platform.
4. Stargate is the search component in our platform and is used to index and search data in Cassandra. It is built into Cassandra and hence avoids the need for running a different Solr/Elasticsearch cluster.
5. When you have dimensional data, UberCube provides a distributed OLAP cube. The cubes are stored into Cassandra and can be queried in stored form.
6. Cassandra along with Spark also allows us to implement complex machine learning algorithms. Using Cassandra for intermediate and shared data storage along with Spark, we can implement systems like artificial neural networks, evolutionary algorithms and deep belief nets, which are very complex to implement over vanilla map/reduce systems.
6. Cassandra also serves as the store for our interactive data visualizations and dashboards, which can be updated in real-time from Hydra or as a result of batch processing from Spark.
7. Lastly we use Cassandra to store all our platform and application logs and metrics.
In terms of versions, our current client clusters are running Apache Cassandra 1.2.11 while we will be rolling out our next release on Apache Cassandra 2.0.x/2.1.x
Since our production clusters are hosted by us for clients, we can’t divulge exact information on them. But a typical setup runs 30+ Cassandra nodes as part of a 50+ Tuplejump Cluster. Most of the clusters are located within multiple DCs within USA, but a couple of them have geographically distributed setup using DCs in US, Europe, Asia and Australia.
We recommend running high end nodes, as in our setups we generally deploy Spark/Hydra server components on the same systems that are running Cassandra. So things start at 4 core, 16GB RAM with 2 disks and the largest nodes we have are at 16 cores/128 GB RAM using SSDs.
One of the first advice I give people who are just starting with Cassandra is to realize that it is not your father’s database. It is a very different creature. So spend time learning it, understanding how data modeling is done in Cassandra, its strengths and limitations as well the APIs. There are many people who have been burnt with a wrong data model.
I definitely recommend the Cassandra Conf videos and other videos at Planet Casssandra. Some must see videos are Patrick McFadin on Data Modeling (especially the Top Data Model ones), Ed Anuff presentations on UserGrid and indexing and Aaron Morton on Cassandra Internals.
We have found Cassandra community to be very active and helpful. Since the community sees many new adopters coming in, it is great that the attitude in the community is helping and answering queries with solutions rather than just saying RTFM. Though there are contradicting goals in any community, especially one this diverse, I have seen most of the issues/differences are resolved more or less amicably and in open forums, which is a good sign for a healthy community.
Tuplejump’s Cassandra projects
For everyone using Cassandra and specially if you use it with Hadoop, I would like to recommend give Spark/Shark + Cassandra a try, you will never go back to use Hadoop. I would like to invite the Cassandra community to see our presentation on Tuplejump Platform given in Dec 2013 at the Spark Summit and try out our open source libraries Calliope, Stargate and Cash and give us feedback to make them better.