Niraj Katwala: CTO at Healthline Networks
Christian Hasker: Editor at Planet Cassandra, A DataStax Community Service
TL;DR: Healthline Networks has two lines of business: One line of business is called Healthline.com, which is a public-based consumer website for health information. The other part is when they serve enterprise customers through their Search and Taxonomy Platform.
Healthline Networks is using Cassandra primarily for their search engine, storing their system of records for all of the documents that they index. They needed a system that would support very fast writes and was able to scale across multiple indexing nodes.
They have two data centers, one in San Francisco and one in New Jersey. In their pipeline, they have ~20 indexing machines serving a node balance of 5 nodes per cluster on the front end. The index size is approximately 400 gigabytes and they index close to 400 million documents. These are all healthcare-related documents. Healthline relies on SSDs and 20GB RAM machines with about 100GB of disk space.
Hello, everyone. I’m here today with Niraj Katwala from Healthline Networks. Niraj, can you tell us a little bit about what Healthline Networks does and your role?
Healthline Networks is a 11-year-old company and its previous incarnation was called InterMap Systems; in 2005, it was re-branded Healthline Networks. We have two lines of business: One line of business is called Healthline.com, which is a public-based consumer website for health information.
The other part is when we serve enterprise customers through our Search and Taxonomy Platform.
Major customers include Aetna, USC, Yahoo! Health (which is entirely run by Healthline), Reed Elsevier (the largest publisher of medical information in the world), and GE. We are venture funded with around 125 people in our San Francisco offices and in the southwest region of New York. We also have some assistance in India, as well.
Great. What’s your role there, Niraj?
I am the Chief Technology Officer and I am also the General Manager for our enterprise licensing business.
Great. Niraj, I have a two-part question for you: Why Cassandra? As an 11-year-old company now, I imagine you started off building your platform on something other than Cassandra. What was that and why the transition to Cassandra?
We’re using Cassandra primarily for our search engine, storing basically our system of records for all the documents that we index.
We needed a system that would support very fast writes and was able to scale across multiple indexing nodes. We transitioned to Cassandra around three years ago and at that point, there weren’t that many other systems available. HBase was the other one, which we evaluated at that time. For speed, scalability and distribution across multiple indexing nodes is what we wanted Cassandra for.
Were you on a relational technology before that?
Primarily, yes, relational and file based; we did not have anything like Cassandra available at that point.
Then when it comes to the search technology on top of Cassandra, is there something you’ve written yourselves or are you using something from the open source community, such as Solr or ElasticSearch?
We do use Solr but the core algorithms, which do the mapping of the content with the taxonomy that we have built, is entirely built by Healthline.
Perfect. You talked about indexing on the nodes; can you talk a little bit about what your environment looks like there?
For a couple of clients we are using the Amazon Cloud, where we have installed our entire pipeline. For our own internal purposes such as Healthline.com and the majority of our clients, we use our own data centers.
We have two data centers, one in San Francisco and one in New Jersey. In our pipeline we have ~20 indexing machines serving a node balance of five nodes per cluster on the front end. Our index size is approximately 400 gigabytes and we index close to 400 million documents. These are all healthcare-related documents. We definitely use SSDs and 20GB RAM machines with about 100GB of disk space.
Great. You’ve been using Cassandra now for three years; I’m sure you’ve seen it mature very rapidly. For those who are new coming to Cassandra, are there any tips or pieces of advice that you would pass along to them?
I feel that the community is very strong and if you have questions, you can easily be able to figure out most of them through the documentation and the Apache Cassandra mailing list. I would recommend everyone to commit to the Apache Gora Project; one of our engineers commits to the Gora Project and that has been very valuable. Here is a video from Cassandra Summit 2013 discussing Apache Gora, titled “Taking Bytes from Cassandra Clients”:
Besides that, one area which I would really encourage everybody to participate in is that you need to know the APIs very well. If you know the APIs very well, know the documentation and have active participation on the mailing list, you should be more than fine with using Cassandra. Additionally, from a pure interest of enhancement, think about what is it that you need to do to improve visibility inside the database.
It’s still probably mostly shell script based but any additional input you can contribute as a developer to figure out what goes on inside Cassandra, that would really make their lives much easier as well.