KDM: An Automated Data Modeling Tool for Apache Cassandra, Pt. 2

September 2, 2015

By 

Andrey Kashlev, Big Data Researcher at Wayne State University
Andrey Kashlev is a PhD candidate in big data, working in the Department of Computer Science at Wayne State University. His research focuses on big data, including data modeling for NoSQL, big data workflows, and provenance management. He has published numerous research articles in peer-reviewed international journals and conferences, including IEEE Transactions on Services Computing, Data and Knowledge Engineering, International Journal of Computers and Their Applications, and the IEEE International Congress on Big Data. Catch Andrey at Cassandra Summit 2015, presenting “World’s Best Data Modeling Tool for Apache Cassandra“.

This is the second article of our series on KDM, an online tool that automates Cassandra data modeling. We highly recommend that you first read part 1 of this series where we overview our tool and use it to build a data model for an IoT application. In this article, we will demonstrate a more complex use case and will cover some of the advanced features of KDM.

 

Use Case: A Media Cataloguing Application

We will create a data model for a media cataloguing application that manages information about artists, their albums, songs, users, and playlists. Users often browse through artists, albums and songs, create, play, and share playlists, and invite their friends to sign up for the app. We use KDM to design a database that will efficiently support our application. Database design using KDM consists of five steps, starting with a conceptual data model and ending with a Cassandra database schema.

 

Step 1: Design a Conceptual Data Model.

As we discussed in part 1, we first design a conceptual data model for our application (Fig. 1).

Screen Shot 2015-09-01 at 3.07.22 PM

Fig. 1. A conceptual data model for our media cataloguing application.

To ensure correctness of our data model, we must carefully specify key and cardinality constraints. Note that a user can be identified by either username, or email. Thus, we specify the two alternative keys by right-clicking on User -> Set keys, as shown in Fig. 2. We will later see that KDM uses this information when generating a logical data model. Optionally, we may annotate username and email as key1 and key2 on the conceptual data  model (Fig. 1). We also assume that an album is uniquely identified by a combination of title and year.

Screen Shot 2015-09-02 at 3.52.36 PM

Fig. 2. Specifying alternative keys username and email for the User entity type.

Step 2: Specify Access Patterns

We now specify data access patterns associated with the application tasks. Consider the application workflow shown in Fig. 3. It models a web-based application that allows users to interact with various web pages (tasks) to retrieve data using well-defined queries. For example, upon logging in, a user starts with browsing albums of a particular artist, genre or date (Q1Q3). Once the user finds an album of interest, he searches for playlists featuring this album (Q4, Q5), etc.

Screen Shot 2015-09-02 at 3.53.43 PM

Q1: Find albums by a given artist. Order results by year (DESC).

Q2: Find albums by genre released after a given year. Order results by year (DESC).

Q3: Find albums of a given genre released in a given country after a given year. Order results by year (DESC).

Q4: Find playlists by a given album.

Q5: Find distinct playlists by a given album.

Q6: Find music genres featured by a given playlist. Show distinct genres.

Q7: Find a user who created a given playlist.

Q8: Find users who shared a given playlist.

Q9: Find users invited by a given user.

Q10: Find a user who invited a given user.

Q11. Find distinct playlists created by a given user, featuring music of a given style.

Q12. Find distinct albums featured by a given playlist.

Fig. 3. An application workflow for our media cataloguing use case.

We now discuss several advanced features of KDM concerning access patterns.

The Q4 access pattern retrieves all playlists featuring songs from a given album. By default, using the “find value” feature on Playlist.id, name and tags attribute types would create a schema storing all playlists related to a given album. In such a case, if a playlist P features two songs from a given album, it will appear twice in the query result. While such information might be useful for some application tasks (e.g., to find how much a playlist is related to a given album), in many cases, such as in Q5, the application needs to display only distinct playlists. To accommodate the latter scenario, KDM provides an advanced feature, called “find distinct”, that we use on Playlist.id, name and tags to find distinct  playlists (i.e., distinct combinations of id, name and tags) in Q5, as shown in Fig. 4(a). Fig. 5 shows the complete Q5 access pattern. Similarly, to find distinct genres in Q6, we use “find distinct” on Album.genre, as shown in Fig. 4(b).

Screen Shot 2015-09-02 at 3.54.49 PM

Fig. 4. Finding distinct playlists in Q5 (a), and distinct genres in Q6 (b).

Screen Shot 2015-09-02 at 3.55.28 PM

Fig. 5. The Q5 access pattern.

Q7 and Q8 differ from any of the access patterns discussed so far in an important way. As shown in the figure below, Q7 and Q8 retrieve users who created and shared a given playlist, respectively.

Screen Shot 2015-09-02 at 3.56.25 PM

Note that there are multiple ways in which a user can be related to a playlist – via creates, shares, or plays relationships. Therefore, Q7 as well as Q8 must explicitly define which relationship path is considered, as this affects the resulting logical data model. Indeed, we will see that Q7 and Q8 require two different table schemas. Because specifying relationship paths explicitly is only needed in ER models with cycles, we call such access patterns cyclic.

As shown in Fig. 6, to specify Q7 we change the default “Simple Access Pattern” setting to “Cyclic Access Pattern”. Upon selecting what is given and what is to be found in the query using the “given value” and “find value” menu items respectively, KDM creates dropdown lists for entity types in the GIVEN and FIND tables (Fig. 6). In the case of both Q7 and Q8, we leave the default selections in these dropdowns, as shown in the figure. We will explain the purpose of these dropdowns when we discuss Q9 and Q10.

 

Screen Shot 2015-09-02 at 3.57.16 PM

Fig. 6. Specifying the Q7 access pattern.

KDM lists all entities/relationships involved in the query, and asks the user to configure each path. We click the “configure path” button, to have KDM find and list all the paths between the Playlist and the User types. We choose “,creates,” from the generated list, as shown in Fig. 6. This completes the definition of Q7. Q8 is specified similarly, as shown in Fig. 7.

Screen Shot 2015-09-02 at 3.58.11 PM

Fig. 7. Specifying the Q8 access pattern.

Screen Shot 2015-09-02 at 3.59.15 PMAs shown on the left, the Q9 and Q10 queries retrieve users invited by a given user, and a user who invited a given user, respectively. The Q9 access pattern is specified as shown in Fig. 8. First, since the User entity type appears twice in Q9 in two different roles, to distinguish between the two, we choose “User_inviter” in the dropdown under “GIVEN”, and “User_invitee” in the dropdown under “FIND” (Fig. 8). Next, we press the “configure path” button in the bottom. If an entity type appears on both ends of a path, e.g., User_inviter and User_invitee in both Q9 and Q10, we need to specify the path ourselves, by choosing the “specify custom path” option, as shown in Fig. 8(a). Fig. 8(a,b,c) shows the step-by-step process of entering the path between User_inviter and User_invitee, i.e. “User_inviter-inviter-invites-inviteeUser_invitee”. Fig. 9 shows the final Q9 tab.

Screen Shot 2015-09-02 at 4.01.24 PM

Fig. 8. Entering a relationship path User_inviter-inviter-invites-inviteeUser_invitee.

Screen Shot 2015-09-02 at 4.02.44 PM

Fig. 9. The complete Q9 access pattern.

The Q10 access pattern is specified similarly, except that the path direction is now reversed: “User_invitee-invitee-invites-inviter-User_inviter” (Fig. 10).

Screen Shot 2015-09-02 at 4.03.41 PM

Fig. 10. The Q10 access pattern.

The Q11 access pattern retrieves playlists created by a given user featuring music in a given style. Since there are multiple paths between the User entity type and the style attribute (whose parent entity type is Artist), we must specify the path between the User and the Artist types, as shown in Fig. 11.

Screen Shot 2015-09-02 at 4.04.36 PM

Fig. 11. The Q11 access pattern.

Once all the access patterns have been defined, we generate a Cassandra logical data model, by clicking the Screen Shot 2015-09-02 at 4.05.32 PM  button.

Step 3: Select a Logical Data Model

Fig. 12 shows a logical data model generated by KDM for our use case. It consists of a set of tables that can efficiently support the specified access patterns. The tables are shown in the Chebotko notation [1,2], where K denotes a partition key column, C denotes a clustering key column whose values are stored in ascending (↑) or descending (↓) clustering order, and S denotes a static column. Finally, {} denote a set column, [] and <> are used for list and map columns, respectively.
KDM automatically computes logically correct primary keys, and determines which columns are static (e.g., in table0). For some of our access patterns, several alternative table schemas are possible, in which case KDM produces a list of such schemas for the user to choose from. Whenever possible, KDM derives meaningful table names from the access patterns. The table schemas are generated according to the mapping rules defined in [1,2]. This frees us from having to manually perform the conceptual-to-logical mapping. We choose one table schema per access pattern, and click theScreen Shot 2015-09-02 at 4.06.44 PM  button to have KDM generate a physical data model with default data types.

Screen Shot 2015-09-02 at 4.07.17 PM

Fig. 12. A logical data model generated by KDM for our use case.

Step 4: Configure the Physical Data Model

The screenshot in Fig. 13 shows a part of a physical data model produced by KDM for our use case. A CQL query equivalent to the access pattern is shown underneath each corresponding table schema. We perform various physical optimizations concerning data types, table and column names and partition sizes. Among other things, we have changed table names table0 and table4 to albums_by_genre and albums_by_genre_country, respectively. The SELECT queries in the final CQL script will reflect this change.

Screen Shot 2015-09-02 at 4.08.52 PM

Fig. 13. The first three tables of the physical data model for our use case.

The complete physical data model is visualized in Fig. 14 as a Chebotko diagram. Chebotko diagram [1,2] presents a database schema as a combination of individual table schemas and query-driven application workflow transitions (KDM does not currently draw Chebotko diagrams). As part of the physical optimization, we have split partitions in the Playlist_by_album and Playlist_by_album_distinct tables (queries Q4 and Q5, respectively). We added a bucket column to each table, so that playlists will be uniformly distributed across the buckets. Each bucket will store the bucket number.

Screen Shot 2015-09-02 at 4.11.35 PM

Fig. 14. The Chebotko Diagram for our use case.

We now press the Screen Shot 2015-09-02 at 4.12.48 PM  button to generate a CQL script capturing our physical data model.

Step 5: Download a CQL Script

Fig. 15. shows a CQL script produced by KDM, that can be readily executed against a Cassandra cluster to create a schema.

Screen Shot 2015-09-02 at 4.13.28 PM

Fig. 15. A CQL script generated in KDM.

Summary

In this second article of our series, we have used KDM for a more complex real-life data modeling use case – a media cataloguing application. The use case involves roles, cyclic queries, and queries over multiple entities and relationships. KDM supports such complex real-life scenarios, by providing a number of advanced features, such as setting alternative keys, finding distinct values, and specifying cyclic access patterns. The use case is available in KDM, under “Use Cases -> Media Cataloguing” menu.

 

Acknowledgements

This work would not have been possible without the inspiring ideas and helpful feedback of Dr. Artem Chebotko and Mr. Anthony Piazza. Andrey Kashlev would also like to thank Dr. Shiyong Lu for his support of this project.

 

References

[1] Artem Chebotko, Andrey Kashlev, Shiyong Lu, “A Big Data Modeling Methodology for Apache Cassandra”, IEEE International Congress on Big Data, (In Press), 2015. download

[2] DataStax Training, DS220: Data Modeling with DataStax Enterprise

 

Building an ASP.NET MVC project with Cassandra

August 31, 2015

By 

Aaron Ploetz, Cloud Data Platforming Engineering Consultant at Target
Aaron Ploetz is the Cloud Data Platforming Engineering Consultant for Cassandra at Target. He is active in the Cassandra tags on StackOverflow, and recently became the first recipient of a tag-specific badge in CQL. Aaron holds a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, a M.S. in Software Engineering and Database Technologies from Regis University, and was selected as a DataStax MVP for Apache Cassandra in 2014. Catch Aaron at Cassandra Summit 2015 presenting “Escaping Disco-Era Data Modeling“.

Technical editing by: Paul LemkeLead Application Architect, and Joe MosesSenior Application Architect, at AccuLynx.

 

Have you ever noticed something about code examples for working with Cassandra?  They’re (almost) all written in Java.  While there are Cassandra drivers available for working with several languages, it can be challenging to find useful, working examples.

For those of us who use Cassandra with C# in the .NET world, this problem is all-too familiar.  By and large, we are left scouring the second and (gasp) sometimes third pages of our Google searches.  And often hoping that the links don’t lead us to an unanswered question on Stack Overflow.  When we inevitably come up with nothing, we are left to blaze our own trail.

This article is based on my journey while building a tool using ASP.NET MVC 5.  I thought it would be helpful to write a quick tutorial on this subject.  That way I could help others avoid the perils of augmenting a non-SQL Server model for their .NET Model/View/Controller (MVC) projects.

1. Create the Cassandra Table

First, let’s get our underlying Cassandra table ready.  Assume that I want to build a simple table to keep track of data from the crew of Serenity (a-la Joss Wheedon’s Firefly).  I’ll create a simple table definition in my “presentation” keyspace:

After inserting some data, I’ll run a quick SELECT:

2. Create a new ASP.NET MVC project –

Things are all set on the Cassandra side.  Now, let’s switch over to Visual Studio 2013, and create a new project.  Under the Visual C# templates I’ll select “Web” and then indicate that I want to create a new “ASP.NET Web Application.”  Note that the templates available and options to create a new ASP.NET MVC project may vary depending on your version of Visual Studio.

Screen Shot 2015-08-31 at 6.17.19 AM

Figure 1.  Creating a new ASP.Net Web Application

After naming my project “ShipCrew” I will select the “MVC” template on the next dialog.

Screen Shot 2015-08-31 at 6.19.13 AM

Figure 2.  Selecting the MVC template for my project.

At this point, you may have noticed that the new project already contains some example files to support a few different types of sample applications.  These applications will each have their own models, views, and controllers.  As the focus of this tutorial is to demonstrate how to use Cassandra with ASP.NET MVC, I will skip the part where I recommend removing the example applications (oops).

3. Get the DataStax C# Driver for Apache Cassandra

In my Solution Explorer I will right-click on my project and indicate that I wish to “Manage NuGet Packages.”  I’ll enter “datastax” in the NuGet search input, and the DataStax Cassandra C# Driver package should be returned fairly quickly.  From here I will indicate that I wish to install this package.

Screen Shot 2015-08-31 at 6.20.29 AM

Figure 3.  Installing DataStax C# Cassandra driver through NuGet.

4. Create the initial model, view, and controller files

Next in the Solution Explorer, I will right-click on the “Controllers” folder and indicate that I want to add a new “Controller.”  When prompted to select the more-specific controller types, I’ll just indicate that I want to add a “MVC 5 Controller – Empty.”  I’ll name it “ShipCrewController,” remembering that it is important for it to contain the word “Controller” on the end.

Now I will right-click on the “Models” folder in Solution Explorer, and add a new class.  I’ll name the class “ShipCrewModels.cs.”  I’ll make sure that I’m “using” the Cassandra.Mapping.Attributes library.  This way I can use the “Table” annotation to specify the keyspace and table name to the driver’s mapper component.  After defining the properties (and ensuring that they match the column names in my table) the code will look like this:

With the Controller and Model in-place, the next step is to create the view.  If you look at the code of the ShipCrewController.cs, it should look something like this:

If you want to name your view something other than “Index,” you are free to do so.  But for the purposes of this example, I’ll just leave it like it is.  Next I’ll right-click on the “Index” method name, and a menu should appear.  From this menu I will select “Add View”:

Screen Shot 2015-08-31 at 6.24.04 AM

Figure 4.  Right-click on Index and select “Add View…”

This next step is important.  In the “Add View” dialog that pops up, the view name should be populated with the method name of my controller method.  Since I want to be able to query Cassandra for multiple rows of data and display them, I’ll select the “List” template.  Next, I’ll indicate the “ShipCrewModels” class as my “Model class.”  Very important…I am going to leave the Data context class empty, since we are not using Entity Framework (EF).  I will then click the “Add” button.

Screen Shot 2015-08-31 at 6.24.37 AM

Figure 5.  Add view dialog.  Note that the “Data context class” field is left empty.

If you look at the code for the view that was just created, you’ll see that Visual Studio has automatically generated a display page based on the ShipCrewModels class.  This should also be representative of the “ShipCrew” Cassandra table created earlier.

5. Write the Cassandra data access classes and methods

Again, since we are not using SQL Server with Entity Framework (use of the EF pattern with Cassandra can lead to performance issues) I have some extra steps.  First of all, I need to write a class to handle my Cassandra connection.  Then I need a data access object (DAO) class to handle my DataStax driver methods, along with an interface for each class.  I’ll start by creating a new folder in the root of my project called “DAO.”  This folder will hold my CassandraDAO and ShipCrewDAO class files.

Inside my CassandraDAO.cs, I’ll start by creating a small interface in the same file.

Next, I’ll write the CassandraDAO class and have it use the ICassandraDAO interface.  I’ll make sure it has a no-argument constructor, and that it has private, static variables for Cluster and Session.  This is important, as the best practice for working with the DataStax C# driver is to create these kinds of objects once, and reuse them.  I will also create methods to allow the Cluster to be set, as well as to expose the Session.

Lastly, I’ll define methods to connect to my cluster, as well as a quick “getter” for the “appSettings” section of my web.config.  Note that for this to work properly, I will also have to define appSettings (in my web.config) for “cassandraUser,” “cassandraPassword,” and “cassandraNodes”.

With that finished, I will now build my ShipCrewDAO class in a similar manner.  First, I’ll create the IShipCrewDAO interface in the same file, enforcing the implementation of a “getCrew” method.  I also want that method to be asynchronous, so I will use the “Task” class of type “IEnumerable<ShipCrewModels>”.

I’ll create my ShipCrewDAO class next, instructing it to inherit from my IShipCrewDAO interface.  It will have “protected” variables for the session and mapper, as well as the implementation of “getCrew.”

6. Modify the controller to use Cassandra

Now I can revisit my ShipCrewController code.  First, I’ll create a private static variable for the DAO, as well as a getter for it.  The getter will also check whether or not the DAO is null, and instantiate it if it is.

At this point, I will also change the “Index” method.  As I want it to be called asynchronously, I will alter the return type to be Task<ActionResult>.  Additionally, this is where I will call the “getCrew” DAO method, and bring the data from Cassandra to my view.

Note that my DAO is defined as a static variable.  By doing this, I am making the application aware of my Cassandra connection as a static object, so it will not need to create a new connection for every request.  If you are going to build your MVC/Cassandra application into a solution with multiple projects (potentially sharing a single connection), you may need to use the SimpleInjector to accomplish this.  In that case, take a look at the “SimpleInjector” branch in my GitHub repository for more information.

7. Ready to rock!

At this point, I should be able to run my ASP.NET MVC application in my browser.  After clicking the “Start” button in Visual Studio, the default page will show up.  I’ll modify the URL to “localhost:63252/ShipCrew/Index” and the page with my data should come up.  Note that the first folder should match the controller name (minus the word “Controller”) of “ShipCrew.”  And I have not renamed my view, so the word “Index” should follow it (after a slash, of course).  When the page loads, you should see something similar to this:

Screen Shot 2015-08-31 at 6.29.32 AM

Figure 6.  Finished product!  Well, sort of.

Of course, there is still work to be done to round-out the final page.  There’s plenty that can be removed or altered to fit my requirements.  But at this point I do have a simple, working ASP.NET MVC 5 application displaying data from Cassandra.

The complete code for this tutorial can be found in my GitHub repository at github.com/aploetz/ShipCrew.  DataStax’s Luke Tillman has also created some useful resources for developers building Cassandra applications with C# in the .NET world:

Happy coding!

KDM: An Automated Data Modeling Tool for Apache Cassandra, Pt. 1

August 25, 2015

By 

Andrey Kashlev, Big Data Researcher at Wayne State University
Andrey Kashlev is a PhD candidate in big data, working in the Department of Computer Science at Wayne State University. His research focuses on big data, including data modeling for NoSQL, big data workflows, and provenance management. He has published numerous research articles in peer-reviewed international journals and conferences, including IEEE Transactions on Services Computing, Data and Knowledge Engineering, International Journal of Computers and Their Applications, and the IEEE International Congress on Big Data. Catch Andrey at Cassandra Summit 2015, presenting “World’s Best Data Modeling Tool for Apache Cassandra“.

 

Data modeling is one of the most important steps ensuring performance and scalability of Cassandra-powered applications. The existing Chebotko data modeling methodology [1,2] lays out important data modeling principles, rules and patterns to translate a real-world domain model into a running Cassandra schema. While this approach enables rigorous and sound schema design, it requires specialized training and experience.

In this article, we will showcase our online tool that streamlines and automates the Cassandra data modeling methodology proposed by Chebotko et al. [1,2]. Our tool, called KDM, minimizes schema design effort on the user’s part, by automating the entire data modeling cycle, starting from a conceptual data model and ending with a CQL script. Specifically, KDM automates the most complex, error-prone, and time-consuming data modeling tasks: conceptual-to-logical mapping, logical-to-physical mapping, physical optimization, and CQL generation. We will design data models for two real life use cases from the IoT and media cataloguing domains. KDM is available for free and can be used by developers to build Cassandra data models, as well as by course instructors and students to teach/learn NoSQL data modeling.

 

The Big Picture

Fig. 1 presents an overview of the data modeling process in KDM. A Cassandra solution architect, a role that encompasses both database design and application design tasks, starts data modeling by building a conceptual data model and specifying data access patterns (steps 1 and 2). The access patterns are blueprints of queries that the application will need to run against the database. Next, KDM automatically produces a set of correct logical data models. A logical data model specifies Cassandra tables that efficiently support application queries described during step 2. The solution architect selects the preferred data model (step 3), and KDM generates a physical data model, which the user needs to configure (step 4). Finally, KDM generates a CQL script that the solution architect can download to instantiate the physical data model in a Cassandra cluster (step 5).

Screen Shot 2015-08-25 at 2.05.43 PM

Fig. 1. KDM’s data modeling automation workflow.

 

The crux of Cassandra data modeling is a conceptual-to-logical mapping, performed according to the rules defined in [1,2]. An error in this mapping may lead to a data loss, a poor query performance or even to an inability to support queries. The crucial innovation of KDM is that through automation, the tool eliminates human error from conceptual-to-logical mapping, thereby ensuring sound schema design.

 

Use Case: An IoT Application

We use KDM to create a data model for an IoT application. The application needs to store and query information about sensor networks, sensors themselves, and their measurements, which include temperature, humidity, and pressure. Common application tasks include finding information about sensors in a given network as well as displaying all measurements of a particular sensor.

 

Step 1: Design a Conceptual Data Model.

We start data modeling by describing the data that our application will manage. We drag-and-drop and connect entity, relationship, and attribute types, as shown in Fig. 2. Whenever needed, we can pan the canvas by dragging it with the right mouse button. Two things are essential here.

First, it is important that we specify a key for each entity type.  A key is a minimal set of attribute types that uniquely identify an entity or a relationship. For example, name is the key of the Network entity type, id is the key of the Sensor entity type, and a combination {Sensor.id, records.timestamp, Measurement.parameter} is the key of the Measurement type. Intuitively, each measurement is uniquely identified by the combination of sensor id, timestamp, and the parameter being measured, such as temperature or pressure.

A key is specified by selecting an appropriate attribute type(s) (e..g, name) and pressing the KEY button in the toolbar. The key of Measurement type is specified by right-clicking on Measurement -> Set keys, which opens a dialog to set a compound key. Optionally, we may specify keys for relationship types.

Second, we must specify cardinality constraints, such as 1:1, 1:n, and m:n. Fig. 2 shows the complete conceptual data model for our use case. We will later see how KDM uses the information we have provided when generating a logical data model.

Screen Shot 2015-08-25 at 2.05.53 PM

Fig. 2. A conceptual data model for our IoT application.

 

Step 2: Specify Access Patterns

Cassandra data modeling is a query-driven process that treats data access patterns as first-class citizens. An access pattern prescribes what the query is searching on, searching for, and how the results should be ordered. Based on the queries that our application will run against the database, we compose a list of three access patterns:
Q1
: Find information about all sensors in a given network.

Q2: Find measurements of a given sensor on a given date. Order results by timestamp, show most recent results first, i.e. order results by timestamp (DESC).

Q3: Find measurements of a given sensor of a particular quantity (i.e. a parameter, such as temperature, humidity, etc.). Order results by timestamp (DESC).

To specify data access patterns, we switch to the second tab (step 2). In Q1, since we perform equality search on network name, we right-click on the name attribute type and choose “given value (=)”, as shown in Fig. 3(a). The “name” should appear in the Q1 tab of the Access Patterns panel under “GIVEN”. We now specify searched-for attribute types: sensor id, location, and characteristics. We right-click on the id attribute and choose “find value”, as shown in Fig. 3(b).       

Screen Shot 2015-08-25 at 2.05.59 PM

Fig. 3. Specifying the Q1 access pattern.

 

We do the same for location and characteristics. For documentation purposes, it is a good idea to type in a verbal description of the access pattern in the text area. This completes the specification of Q1. The Access Patterns panel should now look as follows:

Screen Shot 2015-08-25 at 2.06.09 PM

Fig. 4. The Access Patterns panel after specifying Q1.

 

The Q2 access pattern is specified similarly. Recall, that Q2 asks to find all measurements of a given sensor on a given date, and order results by timestamp (DESC). Thus, two attributes are in the “GIVEN” category – sensor id and date. To request query results to be ordered by timestamp in descending order, instead of “find value” we click “find and sort DESC” after right-clicking on the timestamp attribute type:

Screen Shot 2015-08-25 at 2.06.13 PM

Fig. 5. Requesting the query results to be ordered by timestamp in descending order in Q2.

The Q2 tab of the Access Patterns panel should look as follows:

Screen Shot 2015-08-25 at 2.06.19 PM

Fig. 6. The Q2 access pattern.

The access pattern Q3 is specified similarly:

 

Screen Shot 2015-08-25 at 2.06.24 PM

Fig. 7. The Q3 access pattern.

After defining access patterns we need to perform a conceptual-to-logical mapping. This is a critical step that translates a conceptual data model into a set of tables, based on the specified access patterns. It is crucial that this mapping is performed correctly. An incorrect data model may lead to a data loss or an inability to support application queries. Using its advanced algorithms, KDM automatically produces correct logical data models. This eliminates the need to manually map a conceptual data model onto Cassandra tables. To generate logical data models, we click the Screen Shot 2015-08-25 at 2.06.31 PM button on the toolbar.

 

Step 3: Select a Logical Data Model

KDM has produced table schemas for each of the three access patterns (Fig. 8). It automatically determined logically correct primary keys, and produced all possible solutions that meet the requirements we have specified in the previous two steps.

Screen Shot 2015-08-27 at 10.40.55 AM

Fig. 8. A logical data model generated by KDM for our IoT use case.

 

The above table schemas are shown using the Chebotko notation [1,2], where K denotes a partition key column, and C denotes a clustering key column whose values are stored in ascending (↑) or descending (↓) clustering order. Finally, {} denote a set column, [] and <> are used for list and map columns, respectively.

While for some queries (such as Q1) there is only one correct table schema, other queries (Q2 and Q3) can be accommodated by several alternative table schemas. In such cases KDM produces all of such schemas, so that the user could choose the schema he prefers. By default, the first table schema for each query is selected (highlighted red). If needed, the user can select a different schema. For example, we will select the last table schema for Q2 (table3), since it has smaller partitions than the first two schemas table0 and table1. Intuitively, because {id, date} is a more selective partition key than {id} (or {date}), such partition will store fewer rows. We will also select the measurements_by_sensor3 table for Q3, using a similar rationale. Whenever possible, KDM suggests meaningful table names, e.g., sensors_by_network. In other cases, the default name is produced, such as table0, table1, etc.

After selecting a logical data model we press the Screen Shot 2015-08-25 at 2.06.43 PM button to have KDM generate a physical data model with default settings.

 

Step 4: Configure the Physical Data Model.

Fig. 9 shows a physical data model produced by KDM for our use case. A CQL query equivalent to the access pattern is shown underneath each of the corresponding table schemas.

Screen Shot 2015-08-25 at 2.06.50 PM

Fig. 9. A physical data model with default data types generated by KDM for our IoT use case.

 

As needed, we customize table and column names. Next, using the dropdowns, we select data types for all the columns. Finally, we perform appropriate physical optimizations, such as splitting or merging partitions, adding/removing columns, etc. This should be done with great caution, particularly when it comes to altering the primary key structure.

As an example of physical optimization, consider the measurements_by_sensor3 table above, that stores all measurements performed by a given sensor of a particular parameter (e.g., temperature). Each partition of the table stores all the measurements of this parameter. Suppose we decide to split partitions into buckets, such that each bucket stores one month-worth of measurements. To achieve this, we use the Screen Shot 2015-08-25 at 2.17.08 PM button to add a new column to the partition key, and name it “month”. The configured physical data model looks as follows:

Screen Shot 2015-08-25 at 2.06.56 PM

Fig. 10. A configured physical data model for our IoT use case.

Once we have configured and optimized the physical data model, we press the Screen Shot 2015-08-25 at 2.07.06 PM  button to generate a CQL script capturing our data model.

 

Step 5: Download a CQL Script

KDM frees us from having to write the CREATE TABLE statements by generating a CQL script capturing the database schema (Fig. 11). We choose a keyspace name, a file name, specify replication strategy and factor, and download the .cql file. The demo.cql script can be executed against Cassandra to instantiate our physical data model.

Screen Shot 2015-08-25 at 2.07.11 PM

Fig. 11. A CQL script generated by KDM for our IoT use case.

Summary

In this article we introduced automated data modeling for Cassandra with KDM using a basic example from an IoT domain. We demonstrated how KDM automates the most complex and error-prone steps of the data modeling process: conceptual-to-logical mapping, logical-to-physical mapping, physical optimization, and CQL generation. This use case is available in KDM, under “Use Cases -> Internet of Things” menu. In the second part of this series we will build a more elaborate data model for a media cataloguing application, and will demonstrate some of the advanced automation features of KDM, such as the support of roles, alternative keys, cyclic access patterns, and queries involving multiple entities and relationships.

 

Acknowledgements

This work would not have been possible without the inspiring ideas and helpful feedback of Dr. Artem Chebotko and Mr. Anthony Piazza. Andrey Kashlev would also like to thank Dr. Shiyong Lu for his support of this project.

 

References

[1] Artem Chebotko, Andrey Kashlev, Shiyong Lu, “A Big Data Modeling Methodology for Apache Cassandra”, IEEE International Congress on Big Data, (In Press), 2015. download

[2] DataStax Training, DS220: Data Modeling with DataStax Enterprise.

Top Cassandra Summit Sessions for Cassandra Beginners

August 17, 2015

By 

For the first time, this year’s Cassandra Summit presentations span two days and there are too many great talks for me to narrow it down to my traditional top ten list. Instead, I’ll highlight the most exciting talks for the beginner, intermediate and advanced audiences.

Talks that newcomers to Cassandra will learn the most from include:

Presentations from Daniel Chia (Coursera), and Peter Connolly (Macy’s) on their respective experiences migrating from legacy relational databases to Cassandra:

Daniel Chia, Software Engineer, Coursera, Inc.: Coursera’s Adoption of Cassandra

Like many startups, Coursera began its data storage journey with MySQL, a familiar and industry-proven database. As Coursera’s user base grew from several thousand to many millions, we found that MySQL provided limited availability and restricted our ability to scale easily. New product initiatives and requirements provided a perfect opportunity to revisit our choice of core workhorse database.

After evaluating several NoSQL databases, including MongoDB, DynamoDB and HBase, we elected to transition to Cassandra . Cassandra’s relative maturity, masterless architecture (for availability), tunable consistency, and stable low-latency performance made it a clear winner for our needs.

Peter Connolly, Senior Architect, Macy’s, Inc.: Changing Engines in Mid-Flight

This presentation recounts the story of Macys.com and Bloomingdales.com’s migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.

One thing that differentiates this talk from others on Cassandra is Macy’s philosophy of “doing more with less.” You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.

This session will cover:

  1. The process that led to our decision to use Cassandra
  2. The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment
  3. The various schema options that we tried and how we settled on the current one. We’ll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs.
  4. Our lessons learned and next steps

Aaron Ploetz’s talk on distributed data modeling:

Aaron Ploetz, Lead Database Engineer, AccuLynx: Escaping Disco-Era Data Modeling

Building high-performing Cassandra data models requires a query-based approach. However most of us were taught to build relational, normalized data models, which do not work well with Cassandra. Poor performing data models are often built with the idea of storing data efficiently, and then showered with secondary indexes to serve the required queries. Isn’t it time that we learn how to build 21st century data models, without using 1970’s techniques?

Rene Antunez on learning Cassandra administration as an Oracle DBA:

Rene Antunez, Cassandra DBA Team Lead, The Pythian Group: My First 100 days with a Cassandra Cluster

With Apache Cassandra being a massively scalable open source NoSQL database and with the amount of data that we create and copy annually which is doubling in size every two years, it is expected to reach 44 zettabytes, or 44 trillion gigabytes, we can assume that sooner or later a DBA will be handling a Cassandra database in their shop. This beginner/intermediate-level session will take you through my journey of an Oracle DBA and my first 100 days of starting to administer a Cassandra Cluster, show several demos and all the roadblocks and the success I had along this path.

Nate McCall’s talk on Cassandra security:

Nate McCall, Co-Founder, The Last Pickle: Hardening Apache Cassandra for Compliance (or Paranoia)

Every Apache Cassandra installation needs to be secured, either for compliance or security reasons. Out of the box, Cassandra is an open system, free from authentication, authorization and encryption. With little additional effort, however, it can be secured to meet most regulatory and security requirements.

In this talk Nate McCall, Co-Founder at The Last Pickle, will explain how to implement inter-node and client-server SSL Encryption, Client Authentication and Authorisation, Internode Authentication, and JMX security. While few people pose a deep understanding of security, everyone should know how to implement the basics for Apache Cassandra.

Ben Slater on when and how to upgrade your application from relational to Cassandra:

Ben Slater, Chief Product Officer, Instaclustr: When and how to migrate from a relational database to Cassandra

Many applications are initially developed using relational database technology before upsizing to Cassandra as they mature.

This presentation will examine the indicators that it is time to start considering a migration away from relational and factors to consider when making the decision to move. We will then discuss different approaches for implementing the migration and how to plan, estimate and manage the work along with key risks and gotchas you may encounter.

Jason Kusar’s talk on dealing with latency in a globally distributed cluster:

Jason Kusar, Senior Software Engineer, Vistronix: Global Deployments with Bad Comms

We have been running a truly global deployment for 3+ years now. With datacenters in D.C., England, Europe, and Japan, we would be hard pressed to span more distance without visiting Antarctica. Bandwidth in and out of our datacenters varies greatly, as does latency. Comms are not always 100% reliable either. This was one of the main things that led us to choose Cassandra. This talk will cover many of the lessons learned and best practices for both preventing issues and recovering from them in the real world.

Finally, Marcos Ortiz on Cassandra in a .NET environment is timely given the new level of support for Windows in Cassandra 2.2:

Marcos Ortiz, Open Source Advocate, UCI: Inserting Apache Cassandra in a .Net environment

It highlights how my team migrated from SQL Server to Cassandra in a .Net environment, and how we dealt with code refactoring, Cassandra optimizations in Windows Servers, and more tips.

How To Convince Your Boss To Send You To Cassandra Summit 2015

July 17, 2015

By 

So, your boss needs some persuading to agree to send you to the Cassandra Summit and “the fact that it will be the largest NoSQL conference on the planet, with 3,500+ attendees”, is not convincing enough. Are the sessions, training, and events really worth the cost and the time away from your desk?

The answer is absolutely yes! But if that’s not enough of a reason, here are four ways to help you convince your boss:

Cassandra Is The Proven Technology

Nothing speaks louder than the technology itself. The Apache Cassandra project is heading into its sixth year and continues to be the best NoSQL database for Ring Architecturemodern online applications. If your boss is not familiar with NoSQL or Cassandra technology (well they really should by now), here’s a quick snapshot: Apache Cassandra is an open-source distributed database management system built for today’s Web, mobile, and IoT applications. It is built for managing large amounts of dynamic data across many commodity servers, while providing around-the-clock availability and no single point of failure.

Cassandra offers capabilities that relational databases and other NoSQL databases simply cannot match such as: continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. You can find more details about the benefits of Cassandra here.

Many companies have successfully deployed and benefited from Apache Cassandra including some large companies such as: Apple, Comcast, Instagram, Spotify, eBay, Rackspace, Netflix, and many more. The larger production environments have PB’s of data in clusters of over 75,000 nodes.

 

Unlimited Learning Opportunities with Cassandra Experts From Around The World

At the heart of the Cassandra Summit are the top notch sessions, use cases and trainings delivered by the best Cassandra Experts from around the world. Our speaking committee is in hack mode reviewing over 200 submissions we have received to make sure the best quality of content will be delivered at the Summit this Screen Shot 2015-07-17 at 11.38.09 AMyear. For a little taste of what you can expect, here are some sessions from the Summit last year presented by Cassandra users from Apple, Sony, ING, Netflix, Instagram, Activision Blizzard, Databricks, Target, The Weather Channel and Credit Suisse.

There will be 10 different tracks at the Cassandra Summit this year covering topics on analytics, global deployments and theory. No matter if you are new to or familiar with Cassandra, if you are a developer, architect or an administrator, and if you are interested in tutorials or best practices, we will have something tailored just for you.

And this year’s Summit just got even better. We will be partnering with O’Reilly Media to proctor exams for Cassandra Developer / Architect / Administrator certification at the Summit. Attend five hours of training on Cassandra data modeling, internals and architecture with DataStax experts before taking your test. Successful participants will receive their certificates at the event.

Stay tuned for more details on sessions, training and certification. And mix and match to rock your Summit experience.

 

A Networking PageantScreen Shot 2015-07-17 at 11.38.23 AM

 

No other event would be a better place to meet with your peers from the Cassandra community and big data professionals across all industries. The Cassandra Summit is the place to be! Over the years, we’ve seen a huge surge  in both the growth of the community as well as attendance of the event itself. If you were one of the first 130+ attendees at the first Cassandra Summit in 2010, you sure will be amazed when we bring together over 3,500 attendees this year.

Meet like-minded people, catch up with old friends and make new ones at the largest NoSQL event in the world.

 

Source of Inspiration and Innovation

 

Take a look back at our Summit keynote from last year.  Sony spoke about their vision in the future of gaming and Orbeus introduced their game changing facial recognition technology. Cassandra Summit is full of innovative ideas that will truly inspire you to do amazing things. Stay tuned for our Summit keynote this year, we promise you won’t be disappointed.Screen Shot 2015-07-17 at 11.38.32 AM

Another highlight of the Summit is Cassandra Live – you know you can’t miss it. Cassandra Live is a dream come true, where we showcase some real-world applications built on Cassandra. Last year, we had engineers from companies like Instagram, Hulu and Spotify to share their stories with you in person. Cassandra Live is where you can see and experience the technology yourself and hang out with the engineers who built it. How cool is that? A little sneak peak into this year’s Cassandra Live? Well, there will be a lot of Connected Things and Personalized apps to keep you busy.

So how can you explain all of this to your boss in a way that will ensure you are able to attend?  We’ve crafted a letter you can grab, personalize, and send to your boss to justify your trip. It doesn’t get any easier.  See you at Cassandra Summit 2015!

1 2 3 155