- No Events
Joshua Galbraith, Co-Founder and Principal Architect for the Seastar data platform
Joshua Galbraith (@jtgalbraith) is the co-founder and principal architect for the Seastar data platform. He holds a B.S. in mathematics and is continuing his education in statistics and data visualization. He is active in the data science community, and has been contributing to data-intensive applications in academic research, private companies, and community projects over the past 7 years.
Seastar is a managed platform for Apache Cassandra that spans hardware infrastructure, a hosting environment, a self-service API and dashboard, and a support team. It’s everything you need to get clusters up and running quickly and cost-effectively. This platform is being built by a team of engineers at Network Redux who have decades of combined experience in the enterprise hosting industry and have supported a wide variety of database deployments.
Why we’re building Seastar
My team decided early in the product development lifecycle that we wanted to build something that would allow companies to manage growing data sets effectively. Part of this task relies on first choosing, and then managing, a database effectively. We have seen many companies build out the same essential operational components around their data-driven applications. Time that could have been spent on domain-specific tasks was lost to building and deploying domain-independent services that process, store, and enable the analysis of data. Services for log aggregation, metrics collection and display, and backups are often required. Implementing these systems is not a trivial task. To further add to the difficulty of standing up these support services, they need to be present from the beginning, a time when knowledge of a database’s intricacies may still be nascent on many teams. Our goal is to free teams from the burden of building and managing redundant data infrastructure so they can start making use of their data sooner.
I have been following Cassandra’s development from the sidelines since version 0.6 and had spent some time at a previous company learning about and operating Cassandra. Coming from a RDBMS background, I admired that things like data distribution and replication are first-class citizens in the architecture. Cassandra was a young project back then, but it was getting a lot of things right from the beginning. Several years later at Seastar, we considered building around other databases. However, we felt that Cassandra was the best fit as it offers shared-nothing architecture, scalable performance, and a high-level query language. We also liked that Cassandra doesn’t require operating an additional distributed system for communication. The quality of documentation and the abundance of support channels available to Cassandra users was also important.
Provisioning for Cassandra
Over the past year we have spoken with hardware vendors, operators running a variety of Cassandra deployments, and have drawn on our own experiences to find hardware that will complement Cassandra workloads. Specifically, we’ve sought out fast CPU clocks, SSDs with longer life expectancies and excellent write performance, and data centers with single digit millisecond latency to AWS and other major public cloud providers.
We are writing services that enable automated deployment of Cassandra nodes running in Linux containers. These containers will be mapped to dedicated disks, run a minimal number of processes, and have no shell access. Communication between nodes will be encrypted with strong SSL ciphers. Encrypting communication adds a small amount of overhead, but by using containers and avoiding the hypervisor found in a typical cloud environment, users will see a net performance gain.
Our backend services are primarily written in Go and will enable automated provisioning, log retrieval, metrics retrieval, and other critical functionality like backup and repair. This is a massive undertaking, but we believe that the automation of these common tasks is key to operating Cassandra successfully at scale. Administrative functionality, log messages, and metrics data will be exposed through a public API that will also be consumed and presented by our dashboard.
The Seastar Dashboard
Our dashboard is written in EmberJS and provides a beautiful and simple interface on top of our public API. We’ve thought carefully about creating views that support common workflows for database administrators and developers using Cassandra such as adding capacity and monitoring. When a problem arises, the dashboard can suggest how to resolve it or let you open a support request with one of our engineers that automatically includes pointers to help us quickly understand the issue.
On Metrics and Alerting
We are applying some past, painful experiences with busy dashboards and noisy alerting systems to create a service that emphasizes meaningful metrics and actionable alerts specific to Cassandra. An example of an unhelpful metric is average request latency. An average can hide many slow requests and poor user experiences; it is far more useful for a team to know their 95th, 99th, or even 99.9th percentile latencies. Providing this data helps to ensure that SLAs are being met.
Another meaningful metric for Cassandra is the number of blocked and pending tasks in the thread pools at each execution stage. These numbers tell us when and where work is not getting done in Cassandra. Compaction performance and disk utilization are also key factors that alert an operator that it may be time to take action and add capacity to their cluster or revisit their compaction strategies.
Joining the Community
While we are still building the Seastar platform, we want to share its design with the wider community. We’ve put a lot of thought into it and value feedback.
We think that Cassandra is an amazing database with an outstanding community, and we hope that our platform will be a contribution to the larger ecosystem. We want to encourage community growth by making it easier to get started with Cassandra. We firmly believe that our unique blend of infrastructure, software services, and talented team members will foster community growth by offering a platform that is approachable, inexpensive, and a joy to use. If you’re interested in providing feedback please email us or join our beta program!
Evan Chan, Distinguished Engineer at TupleJump
Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project, a Datastax Cassandra MVP, and co-creator and maintainer of the open-source Spark Job Server. He is a big believer in GitHub, open source, and meetups, and has given talks at various conferences including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.
If you are a big data analyst, or build big data solutions for fast analytical queries, you are likely familiar with columnar storage technologies. The open source Parquet file format for HDFS saves space and powers query engines from Spark to Impala and more, while cloud solutions like Amazon Redshift use columnar storage to speed up queries and minimize I/O. Being a file format, Parquet is much more challenging to work with directly for real-time data ingest. For applications like IoT, time-series, and event data analytics, many developers have turned to NoSQL databases such as Apache Cassandra, due to their combination of high write scalability and the ease of using an idempotent, primary key-based database API. Most NoSQL databases are not designed for fast, bulk analytical scans, but instead for highly concurrent key-value lookups. What is missing is a solution that combines the ease of use of a database API, the scalability of NoSQL databases, with columnar storage technology for fast analytics.
Introducing FiloDB. Distributed. Versioned. Columnar.
I am excited to announce FiloDB, a new open-source distributed columnar database fromTupleJump. FiloDB is designed to ingest streaming data of various types, including machine, event, and time-series data, and run very fast analytical queries over them. In four-letter acronyms, it is an OLAP solution, not OLTP.
- Distributed – FiloDB is designed from the beginning to run on best-of-breed distributed, scale-out storage platforms such as Apache Cassandra. Queries run in parallel in Apache Spark for scale-out ad-hoc analysis.
- Columnar – FiloDB brings breakthrough performance levels for analytical queries by using a columnar storage layout with different space-saving techniques like dictionary compression. The performance is comparable to Parquet, and one to two orders of magnitude faster than Spark on Cassandra 2.x for analytical queries. For the POC performance comparison, please see the cassandra-gdelt repo.
- Versioned – At the same time, row-level, column-level operations and built in versioning gives FiloDB far more flexibility than can be achieved using file-based technologies like Parquet alone.
Your Database for Fast Streaming + Big Data
FiloDB is designed for streaming applications. Enable easy exactly-once ingestion from Apache Kafka for streaming events, time series, and IoT applications – yet enable extremely fast ad-hoc analysis using the ease of use of SQL. Each row is keyed by a partition and sort key, and writes using the same key are idempotent. Idempotent writes enables exactly-once storage of event data. FiloDB does the hard work of keeping data stored in an efficient and read-optimized, sorted format.
FiloDB + Cassandra + Spark = Lightning-fast Analytics
FiloDB leverages Apache Cassandra as its storage engine, and Apache Spark as its compute layer. Apache Cassandra is one of the most widely deployed, rock-solid distributed databases in use today, with very well understood operational characteristics. Many folks are combining Apache Spark with their Cassandra tables for much richer analytics on their Cassandra data than is possible with just the native Cassandra CQL interface. However, loading massive amounts of data from Cassandra into Spark can still be very slow, especially for analytics and ad-hoc queries such as averaging or computing the correlation between two columns of data. This is because Cassandra CQL tables stores data in a row-oriented manner. FiloDB brings the benefits of efficient columnar storage and the flexibility and richness of Apache Spark to the rock solid storage technology of Cassandra, speeding up analytical queries by up to 100x over Cassandra 2.x.
Easy Ingestion + SQL + JDBC + Spark ML
FiloDB uses Apache Spark SQL and DataFrames as the main query mechanism. This lets you run familiar SQL queries over your data, and easily connect tools such as Tableau to query your data, using Spark’s JDBC connector. At the same time, the full power of Spark is available for your data, including the machine learning MLlib library and GraphX for graph processing.
Ingesting data is also very easy through Spark DataFrames. This means you can easily ingest data from any JDBC data source, Parquet and Avro files, Cassandra tables, and much, much, more. This includes easily inserting data from Spark Streaming and Apache Kafka.
Simplify your Analytics Stack. FiloDB + SMACK for everything.
Use Kafka + Spark + Cassandra + FiloDB to power your entire Lamba architecture implementation. There is no need to implement a complex Lambda dual ingestion pipeline with both Cassandra and Hadoop! You can use the SMACK stack (Spark/Scala, Mesos, Akka, Cassandra, Kafka) for a much bigger portion of your analytics stack than before, reducing infrastructure investment.
What’s in the name?
I love desserts, and Filo dough is an essential ingredient. One can think of columns and versions of data as layers, and FiloDB wrapping the layers in a yummy high-performance analytical database engine.
Proudly built with the Typesafe Stack
FiloDB is built with the Typesafe reactive stack for high-performance distributed computing and asynchronous I/O – Scala, Spark, and Akka.
Slides from my talk at Cassandra Summit 2015!
If you’d like to learn more, I encourage you to check out my slides from Cassandra Summit 2015, where I present on FiloDB and Spark and Cassandra! The repo and more details such as the roadmap are unveiled in the talk below.
Adam Hutson, Data Architect at DataScale, Inc
Adam Hutson is Data Architect for DataScale, Inc. He is a seasoned data professional with experience designing & developing large-scale, high-volume database systems. Adam previously spent four years as Senior Data Engineer for Expedia building a distributed Hotel Search using Cassandra 1.1 in AWS. Having worked with Cassandra since version 0.8, he was early to recognize the value Cassandra adds to Enterprise data storage. Adam is also a DataStax Certified Cassandra Developer.
Shane LaPan, Co-Founder and Chief Product Officer for DataScale, Inc.
Shane LaPan is co-founder and Chief Product Officer for DataScale, Inc. He is a serial entrepreneur and brings a product focus into engineering teams. Prior to DataScale, Shane led Site Engineering at PayPal. His teams were responsible for delivering infrastructure services for products such as PayPal Point of Sale (Home Depot), PayPal Here, Risk & Fraud, and many others.
Planet Cassandra is the first place many people go to learn about Cassandra. For many, the Try Cassandra page is their first real interaction with Cassandra. In the past, the Try Cassandra page has included video tutorials as well as downloads that you could have used to install Cassandra on your own machine (virtually or physically). After downloading and setting up the Cassandra node(s), only then would you be able to actually issue commands and begin learning about Cassandra’s capabilities.
Wouldn’t it be great if you could jump to the end of that process, and start typing/learning the commands directly from the Try Cassandra page?
We thought so too! At DataScale have taken the Cassandra setup part, and done that for you. All you have to do is use the web-based interactive CQL Shell that we’ve integrated directly into the Try Cassandra page. That’s it.
The rest of the spirit of the Try Cassandra page hasn’t changed. You will still issue mostly the same commands that you would have before, and learn the same things as before. We’ve just simplified it.
What is CQL Shell?
The CQL Shell is a python-based command line client that is included in the Cassandra installation. To simplify the interaction with Cassandra, we made that command line experience available via a web browser. You can issue all the same CQL commands that you would from a command line. The only difference is that you don’t have to worry about where to do that from as it’s all in-browser.
How does CQL Shell work?
The Try Cassandra page is divided into two sections. First is the left menu, which is separated into walkthrough instructions for Developer and Administrator. Second is the main body of the page, which has the CQL Shell that resembles a command line interface.
The user should click the desired walkthrough and try out the commands as shown.
This is done by either copying commands from the walkthrough and pasting them into the CQL Shell or by typing the commands directly into the CQL Shell.
The Developer walkthrough has basic Cassandra concepts with the corresponding CQL commands. The user will be able to create tables as well as insert, update, select, and delete records.
The Administrator walkthrough is a little bit tougher to do in a web browser, as some of it involves interacting with system level command lines. However, we’ve done our best to show the system commands as well as the CQL commands to run and what the expected output would be. It walks the user through the config files & utility tools, and some typical administrative tasks.
What is DataScale doing in the background?
DataScale is in the the business of managing Cassandra as a Service. We have developed an entire platform for managing, monitoring, and providing secure access for application developers to use with their data and/or applications. Part of that platform includes the same web-based CQL shell that is being exposed on Planet Cassandra’s Try Cassandra page, and how we are managing all of the Cassandra infrastructure behind it.
The Cassandra cluster behind the Try Cassandra page is hosted in Amazon’s cloud. It uses a single data center of four EC2 nodes of instance type m3.xlarge. These are quad-core Intel Xeon machines with a pair of 40 GB SSDs and 15 GB of RAM. These four nodes are each running Cassandra version 2.1.3 (at the time of this blog) and are span across three Amazon availability zones. In addition to the Cassandra nodes, we also have a couple smaller machines used for various cluster monitoring/logging and our back-end web server code.
What are some challenges with hosting Try Cassandra?
The Try Cassandra environment is a multi-tenant architecture; meaning that a single cluster of Cassandra serves all the sessions generated from Try Cassandra. Because of this, we’ve had to set a few small rulesets around what can and cannot be done in the environment.
Since many users will be connecting to Try Cassandra at the same time, we have to be able to segregate users from each other. It would be a bad user experience for one user to be able to affect the environment of another user (i.e. User A could potentially delete User B’s data/table structure). So to ensure that could not happen, we have enforced a couple rules. Each user/session gets it’s own dynamically generated user and keyspace created specifically for it. That user is then granted access to only that keyspace and has access to all other keyspaces revoked. At the end of the session (or 3 hours, whichever comes first) the user and keyspace will be removed. This will keep capacity and disk usage to a minimum.
We’ve also disabled a couple of the intrinsic cqlsh commands (COPY & SOURCE) that allow you to load data from files. Since those specific commands are not part of the Try Cassandra walkthroughs, we don’t allow the user to do it. Doing so would also open up the browser to the file system, which is never a good idea.
Is there a limit to the number of users that can connect to CQL Shell at one time?
No. Theoretically, we are only limited by our hardware. The monitoring we have in place will allow us to scale up automatically if any of our capacity metrics reach our thresholds.
Can I spin up my own Cassandra cluster, and code against it in-browser?
Absolutely. Head on over to DataScale.io and sign up to experience our service yourself. We will contact you to set up time to discuss your current implementation and how we can help. Make sure you let us know you found us via PlanetCassandra.org, we’ve set up special pricing for qualifying startups.
Noel Cody, Software Engineer at Spotify
Noel Cody is a Software Engineer at Spotify. His previous work experience includes building and managing an R-based social CRM and audience segmentation system at Edelman Digital. “Cassandra: Data-Driven Configuration” is an original Spotify Labs blog posting. Check out Noel’s website for more information about himself, here: http://noelcody.com/
Spotify currently runs over 100 production-level Cassandra clusters. We use Cassandra across user-facing features, in our internal monitoring and analytics stack, paired with Storm for real-time processing, you name it.
With scale come questions. “If I change my consistency level from ONE to QUORUM, how much performance am I sacrificing? What about a change to my data model or I/O pattern? I want to write data in really wide columns, is that ok?”
Rules of thumb lead to basic answers, but we can do better. These questions are testable, and the best answers come from pre-launch load-testing and capacity planning. Any system with a strict SLA can and should simulate production traffic in staging before launch.
We’ve been evaluating the cassandra-stress tool to test and forecast database performance at Spotify. The rest of this post will walk through an example scenario using cassandra-stress to plan out a new Cassandra cluster.
A CASE STUDY
We’re writing a new service. This service will maintain a set of key-value pairs with two parts:
- A realtime pipeline that writes data each time a song is streamed on Spotify. Data is keyed on an anonymous user ID and the name of the feature being written. For example: (12345, currentGenreListened): rock
- A client that periodically reads all features for an anonymous id in batch.
Our SLA: We care about latency and operation rate. Strict consistency is not as important. Let’s keep average operation latency at a very low < 5ms at the 95th percentile (for illustration’s sake) and ops/s as high as possible, even at peak load.
The access and I/O patterns we expect to see:
- Client setup: Given that we’re writing from a distributed realtime system, we’ll have many clients each with multiple connections.
- I/O pattern: This will be write-heavy; let’s say we’d expect a peak operations/second of about 50K for this service at launch, but would like room to scale to nearly twice this.
- Column size: We expect values written to columns to be small in size, but expect that this may change.
We’ll use cassandra-stress to determine peak load for our cluster and to measure its performance.
Cassandra-stress comes with the default Cassandra installation from Datastax; find it in install_dir/tools/bin. We’re using the version released with Cassandra’s DataStax Community Edition 2.1.
Setup is managed via one file, the .yaml profile, which defines our data model, the “shape” of our data, and our queries. We’ll build this profile in three steps (more detail in the official docs):
Dump out the schema: The first few sections of the profile defining our keyspace and table can be dumped nearly directly from a cql session via describe keyspace. Here’s our setup, which uses mostly default table config values:
Describe data: The columnspec config describes the “shape” of data we expect to see in each of our columns. In our case:
- We’ll expect anonymous userids to have a fixed length, from a total possiblepopulation of 100M values.
- featurename values will vary in length between 5 and 25 characters within a normal (gaussian) distribution, with most in the middle of this range and fewer at the edges. Pull these values from a small population of 4 possible keys.
- We’ll have a moderate cluster of possible featurevalues per key. These values also will vary in length.
The columnspec config:
Define queries: Our single query will pull all values for a provided userid. Our data model uses a compound primary key, but the userid is our partition key; this means that any userid query should be a fast single-node lookup.
The queries config:
Note: The fields value doesn’t matter for this query. For queries with multiple bound variables (i.e. multiple ?s; we’ve got just one), fields determines which rows those variables are pulled from.
Insert operations aren’t defined in the yaml – cassandra-stress generates these automatically using the schema in our config. When we actually run our tests, we’ll run a mix of our userid query and insert operations to mimic the write-heavy pattern we’d expect to see from our service.
SETUP: TEST PROCESS
A final note on process before we start. Throughout testing, we’ll do the following to keep our data clean:
- Drop our keyspace between each test. This clears out any memtables or SStables already holding data from previous tests.
- Because we want to ensure we’re really testing our Cassandra cluster, we’ll use tools like htop and ifstat to monitor system stats during tests. This ensures that our testing client isn’t CPU, memory, or network-bound.
- Of course, the host running cassandra-stress shouldn’t be one of our Cassandra nodes.
Ready to go. Our first command is:
cassandra-stress user profile=userprofile.yaml ops\(insert=24, getallfeatures=1\) -node file=nodes
This will kick off a series of tests, each increasing the number of client connections. Because this is all running within one cassandra-stress process, client connection count is called “client threadcount” here. Each test is otherwise identical, with a default consistency level of LOCAL_ONE. We’ve run this this for both a 3 and 6-node cluster with the following output:
Hardware details: The Cassandra cluster for this test consisted of Fusion-io equipped machines with 64GB memory and two Haswell E5-2630L v3 CPUs (32 logical cores). The client machine was equipped with 32GB memory and two Sandy Bridge E5-2630L CPUs (24 logical cores).
Performance mostly increases with client connection count. The relationship between performance and node count isn’t linear; we make better use of the increased node count with more client threads, and a three-node setup actually performs slightly better at 4 – 8 client threads. Interestingly, under a three-node setup we see a slight decrease between 600 and 900 threads, where overhead of managing threads starts to hurt. This decline disappears in a 6-node cluster.
So we’re good up to many client connections. Given that the Datastax Java client creates up to 8 connections per Cassandra machine (depending where the machine is located relative to clients), we could calculate the number of connections we’d expect to create at peak load given our client host configuration.
Next we’ll look at consistency level and column data size. Here we’ll use a fixed number of client threads that we’d expect to use under max load given our setup. We’ll measure just how much performance degrades as we make our consistency level more strict:
cassandra-stress user profile=userprofile.yaml cl=one ops\(insert=24, getallfeatures=1\) -node file=nodes -rate threads=$THREADCOUNT
# Also ran with ‘cl=quorum’.
And again as we increase the column size about logarithmically, from our baseline’s <100 chars to 1000 to 10000:
cassandra-stress user profile=userprofile_largecols.yaml ops\(insert=24, getallfeatures=1\) -node file=nodes -rate threads=$THREADCOUNT
# Here a new userprofile_largecols yaml defines the featurename and featurevalue columns as being either 1000 (first test) or 10000 (second test) characters long.
Results tell us just how we’d expect our cluster to perform under these conditions:
With a consistency level of ONE and a 6-node cluster, we’d expect a 95p latency of under 5 ms/op. This meets our SLA. This scenario also provides a high number of ops/s, almost 90k. In fact, we could increase our consistency level to QUORUM with 6 nodes and still meet the SLA.
In contrast, the 3-node cluster fails to meet our SLA even at low consistency. And under either configuration, increasing the size of our columns (for instance, if we began writing a large JSON string in the values field rather than a single value) also takes us out of bounds.
Nice. Enough data to back up our configuration. Other questions regarding performance during compaction or failure scenarios could be tested with a bit more work:
- To measure how the cluster performs under a set of failure scenarios: We might simulate downed or fluttering nodes by controlling the Cassandra process via script during additional tests.
- To measure the impact of compaction, repair, GC, and other long-term events or trends: We might run cassandra-stress over a long period (weeks or months) via script, perhaps in a setup parallel to our production cluster. This would allow us to catch and diagnose any emerging long-term issues and to measure performance during compaction and repair, impossible during the short-term load tests described in this post. This would also ensure that writes hit disk rather than RAM only, as shorter-term tests may do. We’d expect Cassandra’s performance to degrade while under compaction or repair and while reading from disk rather than memory.
Anything missing here? One notable limitation: Cassandra-stress simulates peak load only. We’d need another solution to measure non-peak load or to fire requests at a set non-peak rate. Another: This doesn’t address lost operations or mismatched reads/writes; if a service requires very strict consistency, more testing might be necessary.
In the end, this data-driven configuration allowed us to make several key decisions during setup. We were considering launching with 3 nodes – now we know we need more. We’ve set a bound on column data size. And we’ve got data to show that if we do want to increase consistency from ONE to QUORUM, we can.
A little extra planning and we sleep better knowing we’ll handle heavy traffic day-one.
Julien Anguenot, VP of Software Engineering at iland Internet Solutions, Corp
Julien is an accomplished and organized software craftsman. He is also a long time open-source advocate. Julien serves as iland’s Vice President of Software Engineering and is responsible for the strategic vision and development of iland cloud management console, underlying platform and API. Catch Julien at Cassandra Summit 2015, presenting “Leveraging Cassandra for Real-Time Multi Datacenter Public Cloud Analytics“.
We attended our first summit last year after several months of hard work with Cassandra and a successful launch of our brand new iland cloud management console and platform. We’ve had a fantastic time in San Francisco, learned a lot from the community, met awesome people and came back home with a lot of new ideas.
This year, reaching a Cassandra milestone with datacenter #6 joining the ring, we are even more excited to be there since we will be also sharing with the community how we leveraging Open Source Cassandra to provide real-time multi-datacenter public cloud analytics to our customers in the US, Europe and Asia.
iland cloud story & use case
- data & domain constraints
- deployment, hardware, configuration and architecture overview
- lessons learned
- future platform extensions
Come and join us on Wed. the 23rd at 1p.m (M1 – M3)
Thank you Datastax for the opportunity!
This year, iland’s CTO, Justin Giardina, will be joining and so will be Cory Snyder, one of our leading software engineers: just ping us on Twitter if you are interested in touching base (@anguenot or @jgiardina)
Happy summit to everyone!