Jim Zamata: Principle Software Engineer at Digital Reasoning
Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service
TL;DR: Digital Reasoning has a machine learning platform called Synthesys that reads and analyzes human communication. This human communication data is typically unstructured and can come from any source. Synthesys can work on huge amounts of this data; they’ve scaled up to about 100 million input documents so far.
Initially, Digital Reasoning chose Cassandra for its ability to ingest data at fairly high speeds. I’d say that, moving forward, they have customers who are very interested in the multi-datacenter capabilities.
The customers they have that are running Cassandra are running fairly small clusters, although they could grow as they add more data. On average, they initially process 1 million documents or less, then add more incrementally. The clusters start at around 20 nodes on spinning disks but can grow as they add more data.
Hi, everyone. Today we have Jim Zamata, Principle Software Engineer at Digital Reasoning, discussing his use of Apache Cassandra; thanks for joining us today, Jim. Why don’t we start off with a little bit about your background and also what Digital Reasoning does.
I have a background of more than 20 years working with applications that were mainly server-side with a SQL backend. I started out with C++, then later getting into Java. I’ve been here at Digital Reasoning for about four years now. Around the time when I first started at Digital Reasoning, we began looking at the idea that we would want to have a distributed file system that could handle huge amounts of data.
When we surveyed the landscape at that time, Cassandra (which was in its 0.4 stage) seemed like the most promising of the technologies available. It had much higher write speeds than HBase; those were really the two that we had been researching.
Early on, we were primarily focused in getting data into the system. The query speeds were okay and Cassandra seemed to have a good and responsive community behind it; it was very promising.
At Digital Reasoning, we have a machine learning platform called Synthesys that reads and analyzes unstructured data, or more specifically human communication. This data is typically in the form of documents, or unstructured data from any source. Synthesys can work on huge amounts of this data; we’ve scaled up to about 100 million input documents so far. It analyzes that data, extracts entities and forms relationships between those entities. Those relationships being: people, organizations, events, temporal and geo-information. Then, it provides a way for users to further analyze that data or to query it out of the backend.
Is this a hosted solution, a SAR solution? As a new user, how do I get my documents into Synthesys and start having Synthesys form these entity relationships?
All of our current customer deployments have their own clusters set up, so they’re running on their own data. That’s because many of our first customers were in the intelligence field, so they really wanted to have control over the data and the complete environment; we basically gave them the system to install and gave them support for running it.
Great. You talked a little bit about how you looked at Cassandra and adopted it at v0.4; you looked to HBase at that time as well, but Cassandra seemed to be best suited for what you needed. What were the sweet spots of Cassandra that attracted you to it?
At that time, one of the important things was the ability to ingest data at fairly high speeds. I’d say that moving forward, we have customers who are very interested in the multi-datacenter capabilities. That has actually become more important, especially since other ways to load data have become available.
You talked about how you package up Synthesys and your customers run it on-site, on their own hardware. What does a typical installation look like for you at a customer site?
The customers we have that are running Cassandra are running fairly small clusters, although they could grow as they add more data. On average, they initially process 1 million documents or less, then add more incrementally. Those clusters might initially be fairly small, 20 nodes maybe. In terms of the disks, I think most of those are running spinning disks rather than SSDs.
You’ve been with Cassandra since the early days and you’ve seen it evolve very rapidly. What are some of the things you would like to see it doing in the future?
I’ve been working with 1.2 recently, and I’m very happy with it; the stability and performance have been very good. There have been a couple things, some of which they’ve begun to address: for example, it was always difficult to add nodes to a cluster or remove them and the Vnodes have helped with that.
Another big issue for us is related to bulk loading. The typical installation will ingest a large archive (which could potentially be tens of millions of documents or more) and then do incremental ingestions and typically use a bulk loader. So, the bulk loading support has improved a great deal over the years and increased in speed but there are still some issues with it.
One of the problems we ran into recently is that secondary indexes are rebuilt automatically while you are streaming data in, and that turned out to be a point where it would significantly slow the process down. We had to basically drop the secondary indexes, turn off automatic compaction, bulk-load the data and then re-add the secondary indexes and turn compaction back on.
I did go ahead and create a JIRA ticket to provide a way to disable the automatic secondary indexing during a bulk load so you could defer it until after you were done.
Thank you, very much for doing that. It sounds like you are involved in the Cassandra community, certainly through submitting a JIRA ticket, which we really appreciate. Jim, I have one last question for you. Do you have any pieces of advice for a newbie getting started with Cassandra?
I would say that early on, we spent a lot of time experimenting with different data models. My advice is that even though a NoSQL database is going to be very flexible, it still pays to really think about your data modeling up front. I would especially design with your queries in mind; that’s very important. There’s a tendency, especially if you’re coming from a SQL background, to design things in a somewhat abstract manner and think about normalizing the data.