Alex Trull: Cloud Engineer at Medidata Solutions
Christian Hasker: Editor at Planet Cassandra, a DataStax Community Service
TL;DR: Medidata is the leading cloud-based clinical trials management platform working with pharmaceutical companies and research organizations to bring life-saving drugs to market quicker.
Medidata Solutions chose Cassandra to develop its audit service so that platforms managing various aspects of trials can then report on any operation that matters: from a user logging in, to reports on data from a subject, to some key information about a subject that might have changed.
For Medidata’s audit service technical aspects that made Cassandra a good fit were its good write performance and it’s distributed scalability. The gradual scalability of Cassandra is very important to their selection because they intend to collect a large amount of data, over a long period of time.
Alex, thanks for taking the time to talk about your Apache Cassandra use case. Could you please tell us a little about what Medidata does and what your role is?
Medidata is the leading cloud-based clinical trials management platform and we’re very successful in clinical trial circles. Effectively, our clients are both the pharmaceutical companies and research organizations that they work with. To give you some idea of the kind of scope of what that means, we’ve got clinical trials running on our platform all over the world.
Our mission statement might be summed up as to help our clients bring life-saving and life-enabling drugs to market quicker and safer than they would otherwise be able to do if they were designing their own platforms to manage this work. I’m a cloud engineer and I’ve been working with Medidata for about a year and a half.
Great! So data integrity, and data analysis sound very, very important. Could you talk a little bit about the platform and the role that Apache Cassandra plays in the stack?
We chose Cassandra because we’re looking to deploy a new audit service. Now, as you can probably imagine, the audit trail is one of the most valuable things in the entire business – not just to our clients but also to those who regulate our industry. Effectively, the audit trail is like an artifact of proof that a trial has been properly conducted according to the standards (also known as protocol). We’re developing this audit service so that our platforms managing various aspects of trials can then report on any operation that matters to us: from a user logging in, to reports on data from a subject, to some key information about a subject that might have changed. Anything like this would be considered auditable data and so it needs to be collected, and kept in such a way that it’s queryable at a later date by researchers and investigators : It’s critical the audit data be available.
Can you also talk about how your team determined Apache Cassandra was the right fit for this project?
Interestingly, I wasn’t involved in the selection process but I have been involved quite heavily in the implementation. What I can say is that between my own research into Cassandra and the original research done by colleagues, we found it has a good level of maturity. Technical aspects that made Cassandra a good fit were its good write performance and it’s distributed scalability. In particular, what we were interested in, because we’re intending to collect a large amount of data and over a long period of time, the gradual scalability of Cassandra is very important to us – and capabilities like ring doubling to redistribute the data.
What was your database background before Cassandra?
Previously I’ve mostly been working with things like MySQL and some NoSQL experience with CouchDB, which wasn’t overly impressive.
So coming from MySQL how did you find the transition to Cassandra? Anything you would have done differently now, having learned more about working with it?
I would say it requires a paradigm shift in the concept of what a database server is. Really, you have to start looking at a dataset that crosses boundaries that normally nobody typically thinks about. Looking at, particularly in the cloud, the redundancy of such services. Cassandra offers the lowest redundancy that most developers would need to consider. So I think there needs to be a certain amount of architecture put into the thought of your actual key tables, these sorts of things. I don’t speak for all developers but in my experience, developers generally don’t worry too much about databases. But I think we all have to become slightly more architecturally aware with implementation.
Any tips you would like to offer?
In terms of tips, I think specifically for cloud implementation, we actually follow Netflix very closely. We’ve been very impressed by Netflix because of the level of scale they’ve taken such as clusters of 288 nodes. That’s very impressive and they must be doing something right to manage that process – they developed Priam to take care of various operational management tasks of Cassandra. We watch their steps and we tend to implement their services pretty quickly. We also pay attention to some of the other opensource contributors.
This has been great, is there anything else you would like to share with the community?
One item for the community, is that we’ve tested a Chef-driven deployment of a Priam-managed Cassandra cluster in Amazon EC2 that had sixteen nodes – 2 nodes in eight Amazon regions – and it worked very well. We’ve also published a cookbook to make a deployment of OpsCenter for Cassandra in Amazon EC2 too. We’d love to hear back if anyone else uses the code we put out. Both cookbooks are on the OpsCode Community website and support multi-region as well as a single region.
A picture of OpsCenter is worth at least 16 nodes – only a minimum of manual intervention as necessary to deploy this : https://twitter.com/AlexanderTrull/status/388721231312060416/photo/1/large
Links to the cookbooks:
http://community.opscode.com/cookbooks/cassandra-priam (deploys Cassandra and Netflix’s Priam)
http://community.opscode.com/cookbooks/cassandra-opscenter (deploys Opscenter)
I think the more people publish their methods of doing standard things the more everyone else can get to enjoy life and work on new things or on improving processes instead of coming to the same problems again and again in different companies. It would just be a general improvement. It’s more of an open source attitude towards business I think, and that’s something that we’re trying to push. This is the way of the future and the sooner we get there, the better.