The New Try Cassandra Page:  Web-based CQL Shell Improves Interaction With Cassandra Clusters

September 22, 2015


Adam Hutson, Data Architect at DataScale, Inc
Adam Hutson is Data Architect for DataScale, Inc. He is a seasoned data professional with experience designing & developing large-scale, high-volume database systems.  Adam previously spent four years as Senior Data Engineer for Expedia building a distributed Hotel Search using Cassandra 1.1 in AWS.  Having worked with Cassandra since version 0.8, he was early to recognize the value Cassandra adds to Enterprise data storage.  Adam is also a DataStax Certified Cassandra Developer.

Shane LaPan, Co-Founder and Chief Product Officer for DataScale, Inc.
Shane LaPan is co-founder and Chief Product Officer for DataScale, Inc. He is a serial entrepreneur and brings a product focus into engineering teams. Prior to DataScale, Shane led Site Engineering at PayPal. His teams were responsible for delivering infrastructure services for products such as PayPal Point of Sale (Home Depot), PayPal Here, Risk & Fraud, and many others.

Planet Cassandra is the first place many people go to learn about Cassandra.  For many, the Try Cassandra page is their first real interaction with Cassandra.  In the past, the Try Cassandra page has included video tutorials as well as downloads that you could have used to install Cassandra on your own machine (virtually or physically).  After downloading and setting up the Cassandra node(s), only then would you be able to actually issue commands and begin learning about Cassandra’s capabilities.

Wouldn’t it be great if you could jump to the end of that process, and start typing/learning the commands directly from the Try Cassandra page?  

We thought so too!  At DataScale have taken the Cassandra setup part, and done that for you.  All you have to do is use the web-based interactive CQL Shell that we’ve integrated directly into the Try Cassandra page.  That’s it.

The rest of the spirit of the Try Cassandra page hasn’t changed.  You will still issue mostly the same commands that you would have before, and learn the same things as before.  We’ve just simplified it.


What is CQL Shell?

The CQL Shell is a python-based command line client that is included in the Cassandra installation.  To simplify the interaction with Cassandra, we made that command line experience available via a web browser.  You can issue all the same CQL commands that you would from a command line.  The only difference is that you don’t have to worry about where to do that from as it’s all in-browser.
Screen Shot 2015-09-21 at 10.51.24 PM

How does CQL Shell work?

The Try Cassandra page is divided into two sections.  First is the left menu, which is separated into walkthrough instructions for Developer and Administrator.  Second is the main body of the page, which has the CQL Shell that resembles a command line interface.

The user should click the desired walkthrough and try out the commands as shown.

This is done by either copying commands from the walkthrough and pasting them into the CQL Shell or by typing the commands directly into the CQL Shell.

Screen Shot 2015-09-21 at 10.51.19 PM

The Developer walkthrough has basic Cassandra concepts with the corresponding CQL commands. The user will be able to create tables as well as insert, update, select, and delete records.  

Screen Shot 2015-09-21 at 10.51.14 PM

The Administrator walkthrough is a little bit tougher to do in a web browser, as some of it involves interacting with system level command lines.  However, we’ve done our best to show the system commands as well as the CQL commands to run and what the expected output would be.  It walks the user through the config files & utility tools, and some typical administrative tasks.

Screen Shot 2015-09-21 at 10.51.07 PM


What is DataScale doing in the background?

DataScale is in the the business of managing Cassandra as a Service.  We have developed an entire platform for managing, monitoring, and providing secure access for application developers to use with their data and/or applications.  Part of that platform includes the same web-based CQL shell that is being exposed on Planet Cassandra’s Try Cassandra page, and how we are managing all of the Cassandra infrastructure behind it.  

The Cassandra cluster behind the Try Cassandra page is hosted in Amazon’s cloud.  It uses a single data center of four EC2 nodes of instance type m3.xlarge.  These are quad-core Intel Xeon machines with a pair of 40 GB SSDs and 15 GB of RAM.  These four nodes are each running Cassandra version 2.1.3 (at the time of this blog) and are span across three Amazon availability zones.  In addition to the Cassandra nodes, we also have a couple smaller machines used for various cluster monitoring/logging and our back-end web server code.


What are some challenges with hosting Try Cassandra?

The Try Cassandra environment is a multi-tenant architecture; meaning that a single cluster of Cassandra serves all the sessions generated from Try Cassandra.  Because of this, we’ve had to set a few small rulesets around what can and cannot be done in the environment.

Since many users will be connecting to Try Cassandra at the same time, we have to be able to segregate users from each other.  It would be a bad user experience for one user to be able to affect the environment of another user (i.e. User A could potentially delete User B’s data/table structure).  So to ensure that could not happen, we have enforced a couple rules.  Each user/session gets it’s own dynamically generated user and keyspace created specifically for it.  That user is then granted access to only that keyspace and has access to all other keyspaces revoked. At the end of the session (or 3 hours, whichever comes first) the user and keyspace will be removed.  This will keep capacity and disk usage to a minimum.

We’ve also disabled a couple of the intrinsic cqlsh commands (COPY & SOURCE) that allow you to load data from files. Since those specific commands are not part of the Try Cassandra walkthroughs, we don’t allow the user to do it.  Doing so would also open up the browser to the file system, which is never a good idea.


Is there a limit to the number of users that can connect to CQL Shell at one time?

No.  Theoretically, we are only limited by our hardware.  The monitoring we have in place will allow us to scale up automatically if any of our capacity metrics reach our thresholds.


Can I spin up my own Cassandra cluster, and code against it in-browser?

Absolutely.  Head on over to and sign up to experience our service yourself.  We will contact you to set up time to discuss your current implementation and how we can help.  Make sure you let us know you found us via, we’ve set up special pricing for qualifying startups.

Cassandra: Data-Driven Configuration

September 21, 2015


Noel Cody, Software Engineer at Spotify
Noel Cody is a Software Engineer at Spotify. His previous work experience includes building and managing an R-­based social CRM and audience segmentation system at Edelman Digital. “Cassandra: Data-Driven Configuration” is an original Spotify Labs blog posting. Check out Noel’s website for more information about himself, here:

Spotify currently runs over 100 production-level Cassandra clusters. We use Cassandra across user-facing features, in our internal monitoring and analytics stack, paired with Storm for real-time processing, you name it.

With scale come questions. “If I change my consistency level from ONE to QUORUM, how much performance am I sacrificing? What about a change to my data model or I/O pattern? I want to write data in really wide columns, is that ok?”

Rules of thumb lead to basic answers, but we can do better. These questions are testable, and the best answers come from pre-launch load-testing and capacity planning. Any system with a strict SLA can and should simulate production traffic in staging before launch.

We’ve been evaluating the cassandra-stress tool to test and forecast database performance at Spotify. The rest of this post will walk through an example scenario using cassandra-stress to plan out a new Cassandra cluster.


We’re writing a new service. This service will maintain a set of key-value pairs with two parts:

  1. A realtime pipeline that writes data each time a song is streamed on Spotify. Data is keyed on an anonymous user ID and the name of the feature being written. For example: (12345, currentGenreListened): rock
  2. A client that periodically reads all features for an anonymous id in batch.

Screen Shot 2015-09-10 at 10.49.57 AM

Our SLA: We care about latency and operation rate. Strict consistency is not as important. Let’s keep average operation latency at a very low < 5ms at the 95th percentile (for illustration’s sake) and ops/s as high as possible, even at peak load.

The access and I/O patterns we expect to see:

  • Client setup: Given that we’re writing from a distributed realtime system, we’ll have many clients each with multiple connections.
  • I/O pattern: This will be write-heavy; let’s say we’d expect a peak operations/second of about 50K for this service at launch, but would like room to scale to nearly twice this.
  • Column size: We expect values written to columns to be small in size, but expect that this may change.

We’ll use cassandra-stress to determine peak load for our cluster and to measure its performance.


Cassandra-stress comes with the default Cassandra installation from Datastax; find it in install_dir/tools/bin. We’re using the version released with Cassandra’s DataStax Community Edition 2.1.

Setup is managed via one file, the .yaml profile, which defines our data model, the “shape” of our data, and our queries. We’ll build this profile in three steps (more detail in the official docs):

Dump out the schema: The first few sections of the profile defining our keyspace and table can be dumped nearly directly from a cql session via describe keyspace. Here’s our setup, which uses mostly default table config values:


Describe data: The columnspec config describes the “shape” of data we expect to see in each of our columns. In our case:

  • We’ll expect anonymous userids to have a fixed length, from a total possiblepopulation of 100M values.
  • featurename values will vary in length between 5 and 25 characters within a normal (gaussian) distribution, with most in the middle of this range and fewer at the edges. Pull these values from a small population of 4 possible keys.
  • We’ll have a moderate cluster of possible featurevalues per key. These values also will vary in length.

The columnspec config:


Define queries: Our single query will pull all values for a provided userid. Our data model uses a compound primary key, but the userid is our partition key; this means that any userid query should be a fast single-node lookup.

The queries config:


Note: The fields value doesn’t matter for this query. For queries with multiple bound variables (i.e. multiple ?s; we’ve got just one), fields determines which rows those variables are pulled from.

Insert operations aren’t defined in the yaml – cassandra-stress generates these automatically using the schema in our config. When we actually run our tests, we’ll run a mix of our userid query and insert operations to mimic the write-heavy pattern we’d expect to see from our service.


A final note on process before we start. Throughout testing, we’ll do the following to keep our data clean:

  • Drop our keyspace between each test. This clears out any memtables or SStables already holding data from previous tests.
  • Because we want to ensure we’re really testing our Cassandra cluster, we’ll use tools like htop and ifstat to monitor system stats during tests. This ensures that our testing client isn’t CPU, memory, or network-bound.
  • Of course, the host running cassandra-stress shouldn’t be one of our Cassandra nodes.


Ready to go. Our first command is:

cassandra-stress user profile=userprofile.yaml ops\(insert=24, getallfeatures=1\) -node file=nodes

This will kick off a series of tests, each increasing the number of client connections. Because this is all running within one cassandra-stress process, client connection count is called “client threadcount” here. Each test is otherwise identical, with a default consistency level of LOCAL_ONE. We’ve run this this for both a 3 and 6-node cluster with the following output:


Hardware details: The Cassandra cluster for this test consisted of Fusion-io equipped machines with 64GB memory and two Haswell E5-2630L v3 CPUs (32 logical cores). The client machine was equipped with 32GB memory and two Sandy Bridge E5-2630L CPUs (24 logical cores).

Performance mostly increases with client connection count. The relationship between performance and node count isn’t linear; we make better use of the increased node count with more client threads, and a three-node setup actually performs slightly better at 4 – 8 client threads. Interestingly, under a three-node setup we see a slight decrease between 600 and 900 threads, where overhead of managing threads starts to hurt. This decline disappears in a 6-node cluster.

So we’re good up to many client connections. Given that the Datastax Java client creates up to 8 connections per Cassandra machine (depending where the machine is located relative to clients), we could calculate the number of connections we’d expect to create at peak load given our client host configuration.

Next we’ll look at consistency level and column data size. Here we’ll use a fixed number of client threads that we’d expect to use under max load given our setup. We’ll measure just how much performance degrades as we make our consistency level more strict:

cassandra-stress user profile=userprofile.yaml cl=one ops\(insert=24, getallfeatures=1\) -node file=nodes -rate threads=$THREADCOUNT

# Also ran with ‘cl=quorum’.

And again as we increase the column size about logarithmically, from our baseline’s <100 chars to 1000 to 10000:

cassandra-stress user profile=userprofile_largecols.yaml ops\(insert=24, getallfeatures=1\) -node file=nodes -rate threads=$THREADCOUNT

# Here a new userprofile_largecols yaml defines the featurename and featurevalue columns as being either 1000 (first test) or 10000 (second test) characters long.

Results tell us just how we’d expect our cluster to perform under these conditions:



With a consistency level of ONE and a 6-node cluster, we’d expect a 95p latency of under 5 ms/op. This meets our SLA. This scenario also provides a high number of ops/s, almost 90k. In fact, we could increase our consistency level to QUORUM with 6 nodes and still meet the SLA.

In contrast, the 3-node cluster fails to meet our SLA even at low consistency. And under either configuration, increasing the size of our columns (for instance, if we began writing a large JSON string in the values field rather than a single value) also takes us out of bounds.

Nice. Enough data to back up our configuration. Other questions regarding performance during compaction or failure scenarios could be tested with a bit more work:

  • To measure how the cluster performs under a set of failure scenarios: We might simulate downed or fluttering nodes by controlling the Cassandra process via script during additional tests.
  • To measure the impact of compaction, repair, GC, and other long-term events or trends: We might run cassandra-stress over a long period (weeks or months) via script, perhaps in a setup parallel to our production cluster. This would allow us to catch and diagnose any emerging long-term issues and to measure performance during compaction and repair, impossible during the short-term load tests described in this post. This would also ensure that writes hit disk rather than RAM only, as shorter-term tests may do. We’d expect Cassandra’s performance to degrade while under compaction or repair and while reading from disk rather than memory.

Anything missing here? One notable limitation: Cassandra-stress simulates peak load only. We’d need another solution to measure non-peak load or to fire requests at a set non-peak rate. Another: This doesn’t address lost operations or mismatched reads/writes; if a service requires very strict consistency, more testing might be necessary.

In the end, this data-driven configuration allowed us to make several key decisions during setup. We were considering launching with 3 nodes – now we know we need more. We’ve set a bound on column data size. And we’ve got data to show that if we do want to increase consistency from ONE to QUORUM, we can.

A little extra planning and we sleep better knowing we’ll handle heavy traffic day-one.

iLand Counts Down to Cassandra Summit 2015

September 17, 2015


Julien Anguenot, VP of Software Engineering at iland Internet Solutions, Corp
Julien is an accomplished and organized software craftsman. He is also a long time open-source advocate. Julien serves as iland’s Vice President of Software Engineering and is responsible for the strategic vision and development of iland cloud management console, underlying platform and API. Catch Julien at Cassandra Summit 2015, presenting “Leveraging Cassandra for Real-Time Multi Datacenter Public Cloud Analytics“.

We are only 6 days away from Datastax Cassandra summit 2015 edition and, at iland cloud, we’ve had this date marked on our calendar for a while.

We attended our first summit last year after several months of hard work with Cassandra and a successful launch of our brand new iland cloud management console and platform. We’ve had  a fantastic time in San Francisco, learned a lot from the community, met awesome people and came back home with a lot of new ideas.

This year, reaching a Cassandra milestone with datacenter #6 joining the ring, we are even more excited to be there since we will be also sharing with the community how we leveraging Open Source Cassandra to provide real-time multi-datacenter public cloud analytics to our customers in the US, Europe and Asia.


iland cloud story & use case

  1. data & domain constraints
  2. deployment, hardware, configuration and architecture overview
  3. lessons learned
  4. future platform extensions

Come and join us on Wed. the 23rd at 1p.m (M1 – M3)

Thank you Datastax for the opportunity!

This year, iland’s CTO, Justin Giardina, will be joining and so will be Cory Snyder, one of our leading software engineers: just ping us on Twitter if you are interested in touching base (@anguenot or @jgiardina)

Happy summit to everyone!

Oded Peer System Architect at RSA
"We selected Cassandra and DataStax for the availability and scalability it provides - our service needs to be 'always on' and scale to securely support the billions of transactions we protect."
Oded Peer System Architect at RSA

RSA’s intelligence driven security solutions help organizations reduce the risks of operating in a digital world. Through visibility, analysis, and action, RSA solutions give customers the ability to detect, investigate, and respond to advanced threats; confirm and manage identities; and ultimately, prevent IP theft, fraud and cybercrime.

I’ve worked with different SQL databases in the past: DB2, SQL Server, but am now on Cassandra. My current position at RSA is System Architect.

Adaptive Authentication fraud detection

Our team provides a software as a service as transparent risk based authentication. Primarily installed for banks or financial institutions that use it to protect their online banking.

When a user performs actions on the online banking site or mobile app our service analyzes the risk associated with the activity to prevent fraud. Analyzing the risk level and using different policies, we can then trigger different authentication methods to verify the user interaction.

Previously, Adaptive Authentication ran on Oracle until version 11.0. We declared a directive of availability and had a major architectural transformation in Adaptive Authentication version 12 which now runs on Microsoft Azure with Apache Cassandra.

Moving to NoSQL & the cloud

We selected Cassandra and DataStax for the availability and scalability it provides – our service needs to be ‘always on’ and scale to securely support the billions of transactions we protect.

We’re using DataStax Enterprise 4.5 for our Apache Cassandra needs and evaluating Spark in DataStax Enterprise 4.6 for our machine-driven algorithms that today run on a SQL database. We’re evaluating Spark to avoid having to move data around to write to different resources, which is not the most effective method.

We’re using DataStax Java driver 2.1. which is very good. We are using async queries and the results speak for themselves.  We are also using DataStax OpsCenter to initiate some of Cassandra’s operational tasks, while for monitoring we push JMX metrics into Zabbix.

We decided Cassandra was a good fit with Microsoft Azure. Microsoft is investing heavily in Azure; they’re adding features all the time, and it’s maturing fast. At the beginning, the difference between Amazon Web Services and Azure was really big, but as time went on and we’re using Azure they’re catching up.
We considered using different NoSQL databases such as MongoDB, but Cassandra is really the only database that was built as a distributed database. MongoDB was built as a document database and later on they added replication and availability; Cassandra was built from scratch as a distributed database allowing for high availability and partitionability. That was the key factor for us.

Words of wisdom & the Cassandra Community

Most of the time when people talk about Cassandra’s performance, they talk about the write performance. Providing software as a service, we’re bound by strict response time SLAs so reading data fast is important to us, and for our use case the read/write ratio is almost one to one. Cassandra is not as fast reading data as writing and tuning that specifically can be a bit tricky.

I didn’t have any trouble learning Cassandra – the work DataStax does, creating documentation for Cassandra is really good; it’s awesome. I read all about the Cassandra internals to understand how to define our data model and match it to the Cassandra way of thinking. It’s really different from SQL because you don’t have any relations and you don’t have to normalize data, but the documentation was helpful to get through it.

I try to answer questions on the Apache Cassandra user mailing list when I get a chance; mostly about data modeling. I appreciate that Cassandra is open source. When I encounter a bug, and I want to better understand how certain features work in Cassandra, I can just open the source code and find how it works. That’s a huge plus for me.

KDM: An Automated Data Modeling Tool for Apache Cassandra, Pt. 2

September 2, 2015


Andrey Kashlev, Big Data Researcher at Wayne State University
Andrey Kashlev is a PhD candidate in big data, working in the Department of Computer Science at Wayne State University. His research focuses on big data, including data modeling for NoSQL, big data workflows, and provenance management. He has published numerous research articles in peer-reviewed international journals and conferences, including IEEE Transactions on Services Computing, Data and Knowledge Engineering, International Journal of Computers and Their Applications, and the IEEE International Congress on Big Data. Catch Andrey at Cassandra Summit 2015, presenting “World’s Best Data Modeling Tool for Apache Cassandra“.

This is the second article of our series on KDM, an online tool that automates Cassandra data modeling. We highly recommend that you first read part 1 of this series where we overview our tool and use it to build a data model for an IoT application. In this article, we will demonstrate a more complex use case and will cover some of the advanced features of KDM.


Use Case: A Media Cataloguing Application

We will create a data model for a media cataloguing application that manages information about artists, their albums, songs, users, and playlists. Users often browse through artists, albums and songs, create, play, and share playlists, and invite their friends to sign up for the app. We use KDM to design a database that will efficiently support our application. Database design using KDM consists of five steps, starting with a conceptual data model and ending with a Cassandra database schema.


Step 1: Design a Conceptual Data Model.

As we discussed in part 1, we first design a conceptual data model for our application (Fig. 1).

Screen Shot 2015-09-01 at 3.07.22 PM

Fig. 1. A conceptual data model for our media cataloguing application.

To ensure correctness of our data model, we must carefully specify key and cardinality constraints. Note that a user can be identified by either username, or email. Thus, we specify the two alternative keys by right-clicking on User -> Set keys, as shown in Fig. 2. We will later see that KDM uses this information when generating a logical data model. Optionally, we may annotate username and email as key1 and key2 on the conceptual data  model (Fig. 1). We also assume that an album is uniquely identified by a combination of title and year.

Screen Shot 2015-09-02 at 3.52.36 PM

Fig. 2. Specifying alternative keys username and email for the User entity type.

Step 2: Specify Access Patterns

We now specify data access patterns associated with the application tasks. Consider the application workflow shown in Fig. 3. It models a web-based application that allows users to interact with various web pages (tasks) to retrieve data using well-defined queries. For example, upon logging in, a user starts with browsing albums of a particular artist, genre or date (Q1Q3). Once the user finds an album of interest, he searches for playlists featuring this album (Q4, Q5), etc.

Screen Shot 2015-09-02 at 3.53.43 PM

Q1: Find albums by a given artist. Order results by year (DESC).

Q2: Find albums by genre released after a given year. Order results by year (DESC).

Q3: Find albums of a given genre released in a given country after a given year. Order results by year (DESC).

Q4: Find playlists by a given album.

Q5: Find distinct playlists by a given album.

Q6: Find music genres featured by a given playlist. Show distinct genres.

Q7: Find a user who created a given playlist.

Q8: Find users who shared a given playlist.

Q9: Find users invited by a given user.

Q10: Find a user who invited a given user.

Q11. Find distinct playlists created by a given user, featuring music of a given style.

Q12. Find distinct albums featured by a given playlist.

Fig. 3. An application workflow for our media cataloguing use case.

We now discuss several advanced features of KDM concerning access patterns.

The Q4 access pattern retrieves all playlists featuring songs from a given album. By default, using the “find value” feature on, name and tags attribute types would create a schema storing all playlists related to a given album. In such a case, if a playlist P features two songs from a given album, it will appear twice in the query result. While such information might be useful for some application tasks (e.g., to find how much a playlist is related to a given album), in many cases, such as in Q5, the application needs to display only distinct playlists. To accommodate the latter scenario, KDM provides an advanced feature, called “find distinct”, that we use on, name and tags to find distinct  playlists (i.e., distinct combinations of id, name and tags) in Q5, as shown in Fig. 4(a). Fig. 5 shows the complete Q5 access pattern. Similarly, to find distinct genres in Q6, we use “find distinct” on Album.genre, as shown in Fig. 4(b).

Screen Shot 2015-09-02 at 3.54.49 PM

Fig. 4. Finding distinct playlists in Q5 (a), and distinct genres in Q6 (b).

Screen Shot 2015-09-02 at 3.55.28 PM

Fig. 5. The Q5 access pattern.

Q7 and Q8 differ from any of the access patterns discussed so far in an important way. As shown in the figure below, Q7 and Q8 retrieve users who created and shared a given playlist, respectively.

Screen Shot 2015-09-02 at 3.56.25 PM

Note that there are multiple ways in which a user can be related to a playlist – via creates, shares, or plays relationships. Therefore, Q7 as well as Q8 must explicitly define which relationship path is considered, as this affects the resulting logical data model. Indeed, we will see that Q7 and Q8 require two different table schemas. Because specifying relationship paths explicitly is only needed in ER models with cycles, we call such access patterns cyclic.

As shown in Fig. 6, to specify Q7 we change the default “Simple Access Pattern” setting to “Cyclic Access Pattern”. Upon selecting what is given and what is to be found in the query using the “given value” and “find value” menu items respectively, KDM creates dropdown lists for entity types in the GIVEN and FIND tables (Fig. 6). In the case of both Q7 and Q8, we leave the default selections in these dropdowns, as shown in the figure. We will explain the purpose of these dropdowns when we discuss Q9 and Q10.


Screen Shot 2015-09-02 at 3.57.16 PM

Fig. 6. Specifying the Q7 access pattern.

KDM lists all entities/relationships involved in the query, and asks the user to configure each path. We click the “configure path” button, to have KDM find and list all the paths between the Playlist and the User types. We choose “,creates,” from the generated list, as shown in Fig. 6. This completes the definition of Q7. Q8 is specified similarly, as shown in Fig. 7.

Screen Shot 2015-09-02 at 3.58.11 PM

Fig. 7. Specifying the Q8 access pattern.

Screen Shot 2015-09-02 at 3.59.15 PMAs shown on the left, the Q9 and Q10 queries retrieve users invited by a given user, and a user who invited a given user, respectively. The Q9 access pattern is specified as shown in Fig. 8. First, since the User entity type appears twice in Q9 in two different roles, to distinguish between the two, we choose “User_inviter” in the dropdown under “GIVEN”, and “User_invitee” in the dropdown under “FIND” (Fig. 8). Next, we press the “configure path” button in the bottom. If an entity type appears on both ends of a path, e.g., User_inviter and User_invitee in both Q9 and Q10, we need to specify the path ourselves, by choosing the “specify custom path” option, as shown in Fig. 8(a). Fig. 8(a,b,c) shows the step-by-step process of entering the path between User_inviter and User_invitee, i.e. “User_inviter-inviter-invites-inviteeUser_invitee”. Fig. 9 shows the final Q9 tab.

Screen Shot 2015-09-02 at 4.01.24 PM

Fig. 8. Entering a relationship path User_inviter-inviter-invites-inviteeUser_invitee.

Screen Shot 2015-09-02 at 4.02.44 PM

Fig. 9. The complete Q9 access pattern.

The Q10 access pattern is specified similarly, except that the path direction is now reversed: “User_invitee-invitee-invites-inviter-User_inviter” (Fig. 10).

Screen Shot 2015-09-02 at 4.03.41 PM

Fig. 10. The Q10 access pattern.

The Q11 access pattern retrieves playlists created by a given user featuring music in a given style. Since there are multiple paths between the User entity type and the style attribute (whose parent entity type is Artist), we must specify the path between the User and the Artist types, as shown in Fig. 11.

Screen Shot 2015-09-02 at 4.04.36 PM

Fig. 11. The Q11 access pattern.

Once all the access patterns have been defined, we generate a Cassandra logical data model, by clicking the Screen Shot 2015-09-02 at 4.05.32 PM  button.

Step 3: Select a Logical Data Model

Fig. 12 shows a logical data model generated by KDM for our use case. It consists of a set of tables that can efficiently support the specified access patterns. The tables are shown in the Chebotko notation [1,2], where K denotes a partition key column, C denotes a clustering key column whose values are stored in ascending (↑) or descending (↓) clustering order, and S denotes a static column. Finally, {} denote a set column, [] and <> are used for list and map columns, respectively.
KDM automatically computes logically correct primary keys, and determines which columns are static (e.g., in table0). For some of our access patterns, several alternative table schemas are possible, in which case KDM produces a list of such schemas for the user to choose from. Whenever possible, KDM derives meaningful table names from the access patterns. The table schemas are generated according to the mapping rules defined in [1,2]. This frees us from having to manually perform the conceptual-to-logical mapping. We choose one table schema per access pattern, and click theScreen Shot 2015-09-02 at 4.06.44 PM  button to have KDM generate a physical data model with default data types.

Screen Shot 2015-09-02 at 4.07.17 PM

Fig. 12. A logical data model generated by KDM for our use case.

Step 4: Configure the Physical Data Model

The screenshot in Fig. 13 shows a part of a physical data model produced by KDM for our use case. A CQL query equivalent to the access pattern is shown underneath each corresponding table schema. We perform various physical optimizations concerning data types, table and column names and partition sizes. Among other things, we have changed table names table0 and table4 to albums_by_genre and albums_by_genre_country, respectively. The SELECT queries in the final CQL script will reflect this change.

Screen Shot 2015-09-02 at 4.08.52 PM

Fig. 13. The first three tables of the physical data model for our use case.

The complete physical data model is visualized in Fig. 14 as a Chebotko diagram. Chebotko diagram [1,2] presents a database schema as a combination of individual table schemas and query-driven application workflow transitions (KDM does not currently draw Chebotko diagrams). As part of the physical optimization, we have split partitions in the Playlist_by_album and Playlist_by_album_distinct tables (queries Q4 and Q5, respectively). We added a bucket column to each table, so that playlists will be uniformly distributed across the buckets. Each bucket will store the bucket number.

Screen Shot 2015-09-02 at 4.11.35 PM

Fig. 14. The Chebotko Diagram for our use case.

We now press the Screen Shot 2015-09-02 at 4.12.48 PM  button to generate a CQL script capturing our physical data model.

Step 5: Download a CQL Script

Fig. 15. shows a CQL script produced by KDM, that can be readily executed against a Cassandra cluster to create a schema.

Screen Shot 2015-09-02 at 4.13.28 PM

Fig. 15. A CQL script generated in KDM.


In this second article of our series, we have used KDM for a more complex real-life data modeling use case – a media cataloguing application. The use case involves roles, cyclic queries, and queries over multiple entities and relationships. KDM supports such complex real-life scenarios, by providing a number of advanced features, such as setting alternative keys, finding distinct values, and specifying cyclic access patterns. The use case is available in KDM, under “Use Cases -> Media Cataloguing” menu.



This work would not have been possible without the inspiring ideas and helpful feedback of Dr. Artem Chebotko and Mr. Anthony Piazza. Andrey Kashlev would also like to thank Dr. Shiyong Lu for his support of this project.



[1] Artem Chebotko, Andrey Kashlev, Shiyong Lu, “A Big Data Modeling Methodology for Apache Cassandra”, IEEE International Congress on Big Data, (In Press), 2015. download

[2] DataStax Training, DS220: Data Modeling with DataStax Enterprise


1 2 3 156