I’m an Associate Professor in the Department of Computer Science, here at CU Boulder, and my research areas are in software engineering, web engineering, and software architecture. I’m the co-director of Project EPIC. EPIC stands for Empowering the Public with Information in Crisis (http://epic.cs.colorado.edu); Project EPIC is one of the largest—actually, the largest—NSF grant in the area of what we call crisis informatics. We look at how people use social media during times of mass emergency.
In particular, what we do is collect lots and lots of data from Twitter, and we do some automated analysis of that Twitter data. We also have ethnographers and the like that go in and look at the conversations in the tweet stream, tweet by tweet, over thousands of tweets, to understand what people are doing on-line during and after those events.
We collected, for instance, during the 2010 Haiti earthquake and one of the things that we pulled out of our Twitter dataset was the birth of a group called Humanity Road. This was a purely digital volunteer group, consisting of people who came to the event remotely and wanted to help.
Through monitoring the Twitter stream, they found that there were people on the ground in Haiti that were reporting useful information, and the group of volunteers who eventually founded Humanity Road would go in and make sure that those people had minutes on their cell phones (so they could keep tweeting). They’d organize donation drives to purchase those minutes. They would grab information from the Red Cross and retweet it. They would also vet information; so if they saw something go by and it wasn’t confirmed, they’d try to confirm it, and then they’d post it far and wide.
What we do at Project EPIC is study people like that to understand how we could make them more effective in emergency response in the future if we were to design and implement improved software services for colloaborating and coordinating in mass emergency events.
At the start of an event (like the September 2013 Colorado floods), some of our team members will make use of services like TweetDeck to get a feel for the hashtags and keywords that people are using to discuss the event. We then make use of our Twitter data collection infrastructure to submit those keywords to the Twitter Streaming API. We take those tweets and put them into an in-memory queue which is processed by a set of filters to extract various pieces of information about the tweets to help us keep track of the collection. The last filter in the chain has the job of making the tweet persistent by storing it in Cassandra. We have a four-node Cassandra cluster running 24/7 waiting to store those tweets.
We can create a new event whenever it’s needed, and as the tweets come streaming out of the streaming API they go straight into Cassandra, which then guarantees that we’re never going to lose a tweet. We’ll always be able to retrieve it later on. Depending on the event, we see everything from one tweet a second (for small events) up to 100 tweets per second for events like the 2011 Japan Earthquake or the 2012 London Olympics. 100 tweets per second works out to 8.64M tweets per day and with Cassandra we never miss getting those tweets persisted for later analysis.
Well, when we started on our Twitter data collection infrastructure back in January 2010, we made use of a traditional n-layer web application architecture, with separate layers for apps, services, persistence, and the like. We started with MySQL as the initial data store and used Spring’s integration with Hibernate to store and access data in it. It was familiar, easy to use, and with SQL you can, of course, do lots of arbitrary queries on the data long after storing it, but during the 2011 Japan earthquake, our infrastructure based on MySQL finally fell over.
In that event, we actually did have a case where we were having hundreds of tweets per second for sustained amounts of time, and MySQL could not clear the in-memory queue fast enough, and we were sitting there, watching the in-memory queue grow to several tens of thousands of tweets, and worried that if we had a power outage, or if any other problem were to kick in, we’d lose all that data.
At that time, my grad student, Aaron Schram, was doing a deep dive into the NoSQL technology space and found at that time, roughly in mid-2011, that there was a lot of energy around Cassandra and its community, and decided to give it a try. We particularly looked at the characteristics that we needed. We needed something that was always available. We needed to make sure that we could, at any one moment, there’d always be a way to write a tweet, and then we wanted to make sure we’d never lose that tweet. So we needed replication once a tweet was successfully stored. And, we needed something that was scalable… large-scale events generate tens of millions to hundreds of millions of tweets and we never want to delete that data since there’s no way to access it again via the Twitter search API. The size of our data sets are going to do nothing but grow and so Cassandra’s horizontal scalability was exactly what we needed.
In addition, we needed flexibility to store our data without worrying about its structure. Twitter changes the metadata that is associated with a tweet frequently and we didn’t want to be in a situation where we had to deal with schema migration each time the structure of a tweet changed. With Cassandra, we can just store the whole tweet object and deal with the structural changes of those objects over time during the analysis stage.
After looking at all the different technology that was out there, Cassandra bubbled to the top in terms of having exactly those characteristics that we could depend on. Now, we knew that it wasn’t as great of a technology for arbitrary queries after the fact, and so we have been looking at various ways to do analysis with some other tools, although we do want to look at using, for instance, DataStax Enterprise with its support for Solr and Hadoop as a way to be able to apply arbitrary queries over the data we have stored in Cassandra. However, as the place to store all of our tweets and guarantee we would never lose anything, Cassandra is the best technology out there.
For us, it really was matching those characteristics of availability, replication, scalability, and flexibility, and finding the right tool for the job. We have hundreds of events stored consuming terabytes of data and we know that we’re never going to lose that data and, if we ever need more space, we can just add additional nodes to our cluster.
We also like that we have the ability to play with the structure of our cluster. Right now, our four nodes are physical machines sitting in a server rack here at CU. But, in the future, we may want to add additional servers by spinning up virtualized instances on Amazon’s EC2 and then use Cassandra’s data center capabilities to share and integrate those nodes into our existing cluster.
We think Cassandra’s a great choice for storing and managing data, and it’s helped us reach a point where we can now focus almost exclusively on designing and developing the analytics infrastructure that we will layer on top of our existing Cassandra-based data collection infrastructure.