John Riordan CTO at OnSIP
John, thanks for joining us today. Could you briefly describe your role at OnSIP and what OnSIP does?
I am the Chief Technology Officer at OnSIP. My role at OnSIP is to run the engineering group, including research development, design, implementation, and operations of the service. I’m also one of the co-founders of the company, so I’ve been around since the beginning.
What OnSIP provides is a business phone service to multi-size businesses and a suite of communication services as well; these services include instant messaging, presence, voice, and video services almost exclusively based on SIP protocol, which is where our name comes from.
So why the move to Cassandra? What were you on before, what’s your background, did you use anything else alongside it?
The move to Cassandra was made because of the desire to horizontally distribute our platform. Performance and price played a huge factor as well; delivering a higher quality service at a lower cost to our customers was key. It’s an economic driver at the end of the day.
A little bit of our background story: the short of it is, we still make use of MySQL… all of our data was largely stored in MySQL and when we originally rolled out our systems they were driven by data that was stored there and we continue to use MySQL. However, some aspects of our system weren’t a fit for MySQL; we wanted to be able to operate our platform in multiple locations simultaneously without having to be tied to a single MySQL instance.
For example, we have a data center here in New York and we also have a data center in LA; we have customers who are located in California and they route all their traffic through the Los Angeles data center. We also needed to be able to serve customers through New York and talk to customers through the New York data center, so the information needs to flow between both data centers. The location of everyone’s phone is stored in the a database; their phones connect to us, they store their location information in our database – this is a phone registration. When other people want to reach those phones they need to do a database lookup get retrieve the location information of the phone they are trying to reach.
In the original design, we were storing it in MySQL, in a single master server that was located in the New York data center. Doing so created a host of problems, particularly: if there’s a network partition, the LA users can no longer get their locations updated and basically the LA users becomes completely dependent on the New York data center. Then this problem just expands as we add data centers to new locations. Cassandra helped us solve this problem.
How many data centers are you in now? You mentioned LA and New York. Are you expanding to any others?
We currently are working on expanding into Miami and then we have plans on continue onward. That sort of expansion plan is what drove our consideration to move to Cassandra, so that we can continue to expand out by just opening more data centers and plugging-in. We have a Cassandra ring running between New York and LA. We actually have two data centers in New York, so there are three data centers total with a Cassandra ring that runs between all three of them. We’re writing location information locally, and the Cassandra ring takes care of distributing the information around. This works really well for us because in the event that our LA data center is disrupted, the other data centers take over and the LA customers can continue to call each other… there’s no lock-up.
Are you running on your own hardware or are you in the cloud?
It’s critical for us to be able to run and operate the network at a level where we weren’t dropping any packets. Bot packets and jitter add to call quality issues, so we made a decision early on to build our own data center. The data centers we’re in are at 60 Hudson Street and One Wilshire Street; they’re telecom hotels and some of the major telecommunication hubs for interconnection between telephone carriers and data carriers.
We’re co-located in these facilities, so we purchase power, racks, cooling, and internet connections but we own and operate our own network equipment and hardware; we’re a Juniper shop on the networking side and the rest of our system’s built on Dell equipment. We run on CentOS and we’re heavy users of virtualization. On top of that, we make use of Puppet for configuration management. That’s sort of a short rundown of the underlying system.
What else did you look at besides Cassandra?
Well, we looked at lots of things. We didn’t have anything particular in mind when we set out to solve our problems, although, we knew what problem we wanted to solve. We ended up doing a lot of research and some of the stuff that ended up moving us forward was our research into Amazon’s architecture. There’s a lot of good papers written on that: the whole notion of distributed data and things like CAP Theorem.
We went through a number of these services, but ultimately what we were looking for is a service that allowed us to sacrifice consistency for availability; something that would allow us to continue to operate in the case of a network partition. That largely classed the types of platforms we were looking at and if they fit in our needs category; that was a big factor in driving our selection.
Specifically for our use case, imagine we have a phone in LA registering a location with the LA system, it doesn’t really matter that there’s any sort of instantaneous update with respect to what the people in New York see; if it takes some time before some person in New York can reach that phone, that’s okay.
We ended up with Cassandra pretty quickly after doing our homework. Once we figured out what it was we needed and the type of platform that would solve our problems, it became an easy choice for us. We were also looking at open source and happy to get in there. So that’s where we ended up, and we’ve been happy.
Then the last thing is, how was adopting Cassandra? Is there anything you would have done differently, data models, things like that?
Yes, we’re up-to-speed now. One of the things that slowed the process down for us was the column oriented data structure; specifically naming was difficult. I think a lot of people have preconceptions about what a particular word means and whatnot.
Just being able to map the naming of some of the concepts to what was actually going on, once we published this thing. Once we got comfortable with that it was not a problem at all. The structure makes a lot of sense and works well for us but literally the naming of things was a stumbling block for a little bit, just getting through the concept.