August 12th, 2014

User Defined Functions in Cassandra 3.0

Release 3.0 of Apache Cassandra will bring a new cool feature called User Defined Functions (UDF). Yes – users can write code that is executed inside Cassandra daemons.

UDFs are implemented by stateless code. By stateless I mean that a UDF implementation has just its input arguments to rely on. There is nothing like a shared state or execution context.

The programming language is your own choice. Java source support is built-in and even script languages that have JSR-223 support (javax.script) can be used.

So – the following statement creates a simple UDF which calculates some math:

It can be used in a query like this:

UDFs are not restricted to a single argument. Pass in as many arguments you need.

The final 3.0 release will support all all CQL types including user types, tuple types and collections as argument and return types. Currently, user+tuple types and collections are not supported.

What are UDFs good for?

Basically you can move some simple calculations to the Cassandra cluster. But that alone is not a huge win. UDFs are going to use used for functional indexes or aggregation to name a few. So stay tuned :)

Namespaces

The function names used in the previous examples share the same context as built-in CQL functions like token() or now(). This may cause collisions in the future. For example, you created a UDF called foobar and we decide to add a CQL function which is also called foobar. In this case, the CQL function will be preferred. A warning is logged at Cassandra startup. Your only option is to drop that UDF – there is no way to execute it. Bad, of course.

To avoid such name collisions, use UDF namespaces. You can use your company’s name, your project name or whatever is unique for you. Referring to the example at the beginning, let’s assume the namespace super_math:

Note that namespaces are just logical constructs to group UDFs.

UDF Overloading

UDFs allows to use the same function name with different signatures. If you want to code an adder UDF that takes 2, 3 and 4 arguments, create three UDFs:

So each of the following SELECTs works:

Languages

All above examples assume, that the compiled byte code is in the class path of Cassandra since they directly refer to class and method names.

But you can also define UDFs using Java source and JSR-223 script languages.

Input and output

Input parameters and the return value always use the boxed Java types (Integer notintDouble not double, etc) to be able to indicate null value.

CQL and Java type comparison

The last five types will be implemented as part of CASSANDRA-7563.

Java source

Java source example:

But Java source has a restriction – you cannot use boxing and unboxing. We did not implement our own Java compiler – we just use javassist. And javassist does not support boxing and unboxing.

The code is distributed to all Cassandra nodes – you do not need to install Jar files manually.

JSR-223 Script languages

You want to implement your UDFs in Javascript or Groovy?

Javascript is supported out-of-the box. To use Groovy or any other JSR-223 language you have to install the language and it’s JSR-223 provider on each Cassandra node.

Example that implements the willingly used sin function in Javascript:

 

Java bytecode (classes)

Distributing UDF code as classes (jar files) is possible but not recommended. You have to keep all Cassandra nodes in sync with the correct version of the UDF implementation. If only one node does not have the correct implementation, you’ll get into trouble (with your users).

Your UDF method implementations must be public static methods. By default the method is looked up in the specified class using the UDF name (without the namespace). To use another method, append a # and the name of the method to the UDF parameter. The following example looks up the method doubleAdderFor2 in classmy.package.MyAdder:

Schema Replication

UDFs are part of the system schema and transparently migrated to all other nodes in your cluster.

The new table in system keyspace is schema_functions.

CQL Syntax

namespace = UDF namespace
name = UDF name
OR REPLACE = replace the function if it already exists
[NON] DETERMINISTIC = mark a UDF as deterministic or non-deterministic – required for future use in functional indexes. Default is deterministic.
IF NOT EXISTS = only try to create function if it does not already exist
argumentName = name of the argument – used by Java and scripting languages
argumentType = argument type type, only CQL types are supported
returnType = return type type, only CQL types are supported
javaClassAndMethod = fully qualified classname – optionally postfixed with ‘#’ and method name
language = either java or any installed JSR-223 language name like javascript – the language name is not quoted
body = UDF implementation code

namespace = UDF namespace
name = UDF name
IF EXISTS = only try to drop function if it does not already exist

Note: DROP FUNCTION drops all overloads with the same name.

Test your UDF implementations!

Bad UDF implementations may be a great possibility to slow down your cluster. So please test your UDF implementations carefully using unit tests that test your UDFs using any possible input parameter combination. For example also try null, Double.NaN or Infinity as parameters to see what happens.

Security

Without authentication and authorization enabled, everyone can inject code into your Cassandra cluster. You can either trust all your users or enable authentication and authorization and grant permissions only to trusted users. Currently only global CREATEand DROP privileges apply for the CREATE FUNCTION and DROP FUNCTIONstatements. But this is subject to change and going to be improved (CASSANDRA-7557).

References

CASSANDRA-7395 Support for pure user-defined functions (UDF)
CASSANDRA-7562 Java source code for UDFs
CASSANDRA-7526 Defining UDFs using scripting language directly from CQL
CASSANDRA-4914 Custom aggregate and filtering functions in CQL
CASSANDRA-7458 functional indexes
CASSANDRA-7391 Partial indexes
CASSANDRA-7563 UserType, TupleType and collections in UDFs
CASSANDRA-7557 User permissions for UDFs

Author

User Defined Functions in Cassandra 3.0” was created by Robert Stupp, Cologne, Germany
snazy@snazy.de
@snazy

LinkedIn