August 6th, 2014

Init

As mentioned in my portacluster system imaging post, I am performing this install on 1 admin node (node0) and 6 worker nodes (node[1-6]) running 64-bit Arch Linux. Most of what I describe in this post should work on other Linux variants with minor adjustments.

Overview

When assembling an analytics stack, there are usually myriad choices to make. For this build, I decided to build the smallest stack possible that lets me run Spark queries on Cassandra data. As configured it is not highly available since the Spark master is standalone. (note: Datastax Enterprise Spark’s master has HA based on Cassandra). It’s a decent tradeoff for portacluster, since I can run the master on the admin node which doesn’t get rebooted/reimaged constantly. I’m also going to skip HDFS or some kind of HDFS replacement for now. Options I plan to look at later are GlusterFS’s HDFS adapter and Pithos as an S3 adapter. In the end, the stack is simply Cassandra and Spark with the spark-cassandra-connector.

Responsible Configuration

For this post I’ve used my perl-ssh-tools suite. The intent is to show what needs to be done and one way to do it. For production deployments, I recommend using your favorite configuration management tool.

perl-ssh-tools uses a configuration similar to dsh, which uses simple files with one host per line. I use two lists below. Most commands run on the fleet of workers. Because cl-run.pl provides more than ssh commands, it’s also used to run commands on node0 using its –incl flag e.g.cl-run.pl --list all --incl node0.

machines.all is the same with node0 added.

Install Cassandra

My first pass at this section involved setting up a package repo, but since I don’t have time to package Spark properly right now, I’m going to use the tarball distros of Cassandra and Spark to keep it simple. joschi maintains a package on the AUR but I have chosen not to use it for this install. I’m also using the Arch packages of OpenJDK, which isn’t supported by Datastax, but works fine for hacking. The JDK is pre-installed on my Arch image, it’s as simple as sudo pacman -S extra/jdk7-openjdk.

First, I downloaded the Cassandra tarball from apache.org to node0 in /srv/public/tgz. Then on the worker nodes, it gets downloaded and expanded in /opt.


 

To make it easier to do upgrades without regenerating the configuration, I relocate the conf dir to /etc/cassandra to match what packages do. This assumes there is no existing /etc/cassandra.


 

I will start Cassandra with a systemd unit, so I push that out as well. This unit file runs Cassandra out of the tarball as the cassandra user with the stdout/stderr going to the systemd journal (view with journalctl -f). I also included some ulimit settings and bump the OOM score downwards to make it less likely that the kernel will kill Cassandra when out of memory. Since we’re going to be running two large JVM apps on each worker node, this unit also enables cgroups so Cassandra can be given priority over Spark. Finally, since the target machines have 16GB of RAM, the heap needs to be set to 8GB (cassandra-env.sh calculates 3995M which is way too low).

Since all Cassandra data is being redirected to /srv/cassandra and it’s going to run as the cassandra user, those need to be created.


 

Configure Cassandra

Before starting Cassandra I want to make a few changes to the standard configurations. I’m not a big fan of LSB so I redirect all of the /var files to /srv/cassandra so they’re all in one place. There’s only one SSD in the target systems so the commit log goes on the same drive.

I configured portacluster nodes to have a bridge in front of the default interface, making br0 the default interface.


 

The default log4-server.propterties has log4j printing to stdout. This is not desirable in a background service configuration, so I remove it. The logs are also now written to /srv/cassandra/log.


 

And with that, Cassandra is ready to start.


 

Installing Spark

The process for Spark is quite similar, except that unlike Cassandra, it has a master.

Since I’m not using any Hadoop components, any of the builds should be fine so I used the hadoop2 build.

 

Create /srv/spark and the spark user.


 

Configuring Spark

Many of Spark’s settings are controlled by environment variables. Since I want all volatile data in /srv, many of these need to be changed. Spark will pick up spark-env.sh automatically.

The Intel NUC systems I’m running this stack on have 4 cores and 16G of RAM, so I’ll give Spark 2 cores and 4G of memory for now.

One line worth calling out is the SPARK_WORKER_PORT=9000. It can be any port. If you don’t set it, every time a work is restarted the master will have a stale entry for a while. It’s not a big deal but I like it better this way.

spark-submit and other tools may use spark-defaults.conf to find the master and other configuration items.

 

The systemd units are a little less complex than Cassandra’s. The spark-master.service unit should only exist on node0, while every other node runs spark-worker. Spark workers are given a weight of 100 compared to Cassandra’s weight of 1000 so that Cassandra is given priority over Spark without starving it entirely.


 

The master unit is similar and only gets installed on node0. Since it is not competing for resources, there’s no need to turn on cgroups for now.


 

Now deploy all of these configs. Relocate the spark config into /etc/spark and copy a couple templates, then write all the files there. spark-env.sh goes on all nodes. The unit files are described above. Finally, a command is run to instruct systemd to read the new unit files.


 

With all of that done, it’s time to turn on Spark to see if it works.


 

Now I can browse to the Spark master webui.

screenshot

Installing spark-cassandra-connector

The connector is now published in Maven and can be installed easiest using ivy on the command line. Ivy can pull all dependencies as well as the connector jar, saving a lot of fiddling around. In addition, while ivy can download the connector directly, it will end up pulling down all of Cassandra and Spark. The script fragment below pulls down only what is necessary to run the connector against a pre-built Spark.

This is only really needed for the spark-shell so it can access Cassandra. Most projects should include the necessary jars in a fat jar rather than pushing these packages to every node.

I run these commands on node0 since that’s where I usually work with spark-shell. To run it on another machine, Spark will have to be present and match the version of the cluster, then this same process will get everything needed to use the connector.


 

Using spark-cassandra-connector With spark-shell

All that’s left to get started with the connector now is to get spark-shell to pick it up. The easiest way I’ve found is to set the classpath with –driver-class-path then restart the context in the REPL with the necessary classes imported to make sc.cassandraTable() visible.

The newly loaded methods will not show up in tab completion. I don’t know why.


 

It will print a bunch of log information then present scala> prompt.


 

Now that the context is stopped, it’s time to import the connector.


 

To make sure everything is working, I ran some code I’m working on for my 2048 game analytics project. Each context gets an application webui that displays job status.

screenshot

Conclusion

It was a lot of work getting here, but what we have at the end is a Spark shell that can access tables in Cassandra as RDDs with types pre-mapped and ready to go.

There are some things that can be improved upon. I will likely package all of this into a Docker image at some point. For now, I need it up and running for some demos that will be running on portacluster at OSCON 2014.

Installing the Cassandra/Spark OSS Stack” was created by Al Tobey, Open Source Mechanic at DataStax

LinkedIn