The New York Times is the premier news organization and has a worldwide online presence.
I am the systems architect of the nyt⨍aбrik platform that supports the online functions.
nyt⨍aбrik is a “chat” system for things — our client and partner devices, our systems and services.
It is simple and tries to do only a few tasks very well — connect millions of devices to our services, route billions of messages quickly and efficiently, remember every message. Cassandra is the global caching layer for nyt⨍aбrik.
If a service wants to deliver a message to a user device but that device is not connected — we cache the message for later delivery. Many clients want to receive the latest versions of certain kinds of messages — “breaking news” for example — whenever they connect. These messages are cached as well and served on connection based upon client preference (also cached). The cache is useful for analyzing the messages that flow through nyt⨍aбrik. And often we want to retrieve a certain message or see who read it.
Simplicity has helped us make nyt⨍aбrik global, reliable, fast, and efficient. It scales up and down on a minute’s notice to meet demand.
Open source, multi-region support, scalability, and reliability/availability were the primary criterion. We used DynamoDB originally, but converted as it is not multi-region. Riak similarly does not have an open source multi-region capability. And we liked the look of CQL3 and the asynchronous protocol.
Our volumes vary widely which requires scalability. The news must get out, requiring consistent speed and reliability/availability. Our messaging architecture is flat and wide — we wanted a cache to match.
We use Cassandra 2.0.6. We currently run a small cluster — as small as we can get away with: 12 nodes in production across 6 zones in 2 AWS regions: Oregon and Dublin. And our volumes through nyt⨍aбrik are small: 10 to 100 M messages per day. And the messages are small: 1 – 5 KB typically — large message bodies are pushed to S3 / CloudFront and passed by reference. All messages have a ttl — 3 days by default — we never do explicit deletes. We will be adding more regions — and as we start gathering events and provide more client messaging services, volumes will grow rapidly.
Start with the latest release of Cassandra 2 — it is sufficiently stable now – and use CQL3. Try not to get confused by the old terminology many people still use. Learn the physical structures, read path, write path, etc. so you can design high performing tables that support your operations consistently and well.
The community has been good. Very active; very responsive. Take the time to read the JIRA issues so you know what “features” to avoid for now.
P.S. Open-source is great.