How To Convince Your Boss To Send You To Cassandra Summit 2015

July 17, 2015

By 

So, your boss needs some persuading to agree to send you to the Cassandra Summit and “the fact that it will be the largest NoSQL conference on the planet, with 3,500+ attendees”, is not convincing enough. Are the sessions, training, and events really worth the cost and the time away from your desk?

The answer is absolutely yes! But if that’s not enough of a reason, here are four ways to help you convince your boss:

Cassandra Is The Proven Technology

Nothing speaks louder than the technology itself. The Apache Cassandra project is heading into its sixth year and continues to be the best NoSQL database for Ring Architecturemodern online applications. If your boss is not familiar with NoSQL or Cassandra technology (well they really should by now), here’s a quick snapshot: Apache Cassandra is an open-source distributed database management system built for today’s Web, mobile, and IoT applications. It is built for managing large amounts of dynamic data across many commodity servers, while providing around-the-clock availability and no single point of failure.

Cassandra offers capabilities that relational databases and other NoSQL databases simply cannot match such as: continuous availability, linear scale performance, operational simplicity and easy data distribution across multiple data centers and cloud availability zones. You can find more details about the benefits of Cassandra here.

Many companies have successfully deployed and benefited from Apache Cassandra including some large companies such as: Apple, Comcast, Instagram, Spotify, eBay, Rackspace, Netflix, and many more. The larger production environments have PB’s of data in clusters of over 75,000 nodes.

 

Unlimited Learning Opportunities with Cassandra Experts From Around The World

At the heart of the Cassandra Summit are the top notch sessions, use cases and trainings delivered by the best Cassandra Experts from around the world. Our speaking committee is in hack mode reviewing over 200 submissions we have received to make sure the best quality of content will be delivered at the Summit this Screen Shot 2015-07-17 at 11.38.09 AMyear. For a little taste of what you can expect, here are some sessions from the Summit last year presented by Cassandra users from Apple, Sony, ING, Netflix, Instagram, Activision Blizzard, Databricks, Target, The Weather Channel and Credit Suisse.

There will be 10 different tracks at the Cassandra Summit this year covering topics on analytics, global deployments and theory. No matter if you are new to or familiar with Cassandra, if you are a developer, architect or an administrator, and if you are interested in tutorials or best practices, we will have something tailored just for you.

And this year’s Summit just got even better. We will be partnering with O’Reilly Media to proctor exams for Cassandra Developer / Architect / Administrator certification at the Summit. Attend five hours of training on Cassandra data modeling, internals and architecture with DataStax experts before taking your test. Successful participants will receive their certificates at the event.

Stay tuned for more details on sessions, training and certification. And mix and match to rock your Summit experience.

 

A Networking PageantScreen Shot 2015-07-17 at 11.38.23 AM

 

No other event would be a better place to meet with your peers from the Cassandra community and big data professionals across all industries. The Cassandra Summit is the place to be! Over the years, we’ve seen a huge surge  in both the growth of the community as well as attendance of the event itself. If you were one of the first 130+ attendees at the first Cassandra Summit in 2010, you sure will be amazed when we bring together over 3,500 attendees this year.

Meet like-minded people, catch up with old friends and make new ones at the largest NoSQL event in the world.

 

Source of Inspiration and Innovation

 

Take a look back at our Summit keynote from last year.  Sony spoke about their vision in the future of gaming and Orbeus introduced their game changing facial recognition technology. Cassandra Summit is full of innovative ideas that will truly inspire you to do amazing things. Stay tuned for our Summit keynote this year, we promise you won’t be disappointed.Screen Shot 2015-07-17 at 11.38.32 AM

Another highlight of the Summit is Cassandra Live – you know you can’t miss it. Cassandra Live is a dream come true, where we showcase some real-world applications built on Cassandra. Last year, we had engineers from companies like Instagram, Hulu and Spotify to share their stories with you in person. Cassandra Live is where you can see and experience the technology yourself and hang out with the engineers who built it. How cool is that? A little sneak peak into this year’s Cassandra Live? Well, there will be a lot of Connected Things and Personalized apps to keep you busy.

So how can you explain all of this to your boss in a way that will ensure you are able to attend?  We’ve crafted a letter you can grab, personalize, and send to your boss to justify your trip. It doesn’t get any easier.  See you at Cassandra Summit 2015!

Win Free Training & Certification Tickets to Cassandra Summit 2015

July 13, 2015

By 

With Cassandra Summit 2015 fast approaching, we’d like to give everyone a chance to win a free training & certification + standard conference pass to the big event, which allows you to secure your seats for training & certification + standard access to all conference sessions.

Starting July 13th, we will be launching the #CassandraCert contest via Twitter; we have 5 tickets to give away, over a 10 week period, leading up to the event (the contest is bi-weekly).

You can participate in this contest by following @PlanetCassandra and answering the technical questions posted using “#CassandraCert” in your tweet/reply. The first person to answer the question correctly will win a free training & certification + standard conference pass.

Have fun, and we’ll see you at summit!

Official Contest Rules: http://www.planetcassandra.org/planet-cassandra-summit-social-media-official-contest-rules/

Assemble! – Speaking Committee Assembled for Cassandra Summit 2015

July 9, 2015

By 

Much like Nick Fury assembling the Avengers to take on the daunting, yet enjoyable task, of saving the earth from imminent destruction, so too we at DataStax assemble the ‘Speaking Committee’ to review abstracts and to save you from boring content. So who is this mysterious speaking committee and what do they do?

The speaking committee is mainly comprised of MVPs in the Apache Cassandra community, people like Christos Kalantzis, Brian O’Neill, Richard Low, Jason Brown, Evan Chan and Robbie Strickland, who have contributed to Cassandra and have presented at various conferences and previous Cassandra summits. There are around 30 Cassandra MVPs in the speaking committee and you can view last year’s complete MVP roster here: http://www.planetcassandra.org/mvps (we will be celebrating the 2015 – 2016 MVPs at this year’s summit). Also heavily involved in the speaking committee is the DataStax Evangelist team, headed up by Patrick McFadin – they have their finger on the pulse of the community, and have a good eye for topics that will resonate.

Way back in the old days of the Cassandra Summit (five years ago is old days, okay!) it was honestly challenging to find enough people to talk about Cassandra knowledgeably, and we would spend time encouraging and cajoling people into speaking. Often we would employ our co-founder, Matt Pfeil, to call some of his buddies in the community and exchange talks for beers. I like to think we did a pretty good job as the quality of content has always been pretty high. Well in the past two years we have had completely the opposite problem – way more submissions than slots to put them all in.

This year the summit is bigger than ever with 120+ talks across 10 different tracks over two days; whatever your interest in Apache Cassandra and DataStax is there will be top notch content just for you. However, unfortunately we are going to be saying no to a lot of people who have taken the time and energy to submit abstracts; we received over 200 abstracts this year, which is amazing, and such a testament to what a fantastic community we are fortunate to be part of. So, if you don’t get selected this year to present, don’t be too down, you are in good company, and we can help place your talks at other Cassandra-focused events like Cassandra Days or meet-ups.

So how do these 40 people select the talks to be presented? Our very own Al Tobey has built an app (backed by Cassandra of course!) that allows reviewers to look at the titles, abstracts and the bios of the speakers and vote on them. They also have the ability to add comments to the review too.

Screen Shot 2015-07-09 at 10.18.29 AM

A small sample of the abstracts loaded into the Al Tobey voting app

All the reviewers will need to be finished by July 15th and then we will tally up the scores and fill up the slots. On July 17th we will let all speakers know whether or not they have been accepted to present at Cassandra Summit this year.

Again a huge thank you to everyone who took the time to submit; we really appreciate it, and here’s to a fantastic Cassandra Summit 2015!

Fritz Richter Co-Founder & CTO at adsquare
"We use Spark Streaming with Cassandra to post-process every single bid-request in near-real time in the most scalable way."
Fritz Richter Co-Founder & CTO at adsquare

I’m CTO & co-founder at adsquare, Europe’s leading data platform for mobile programmatic advertising.

Our platform supercharges data-driven targeting. With our solution, advertisers and agencies can leverage data to reach their desired audiences and meet campaign goals and on the other side, publishers and third-party providers can on-board and monetize their data.

I’m responsible for platform development and lead the technology and data science departments. I’m a born-and-bred Berliner and my area of expertise is scalable backend architectures and big data.

Cassandra with Spark and Kafka

At adsquare we’re using Datastax Community 2.1. We’re using Apache Cassandra for storage of aggregates and calculations from our real-time event processing system (spark streaming) which in turn is queried by various backend systems, as well as central data storage for our tile-based infrastructure (billions of data points).

We use Spark Streaming with Cassandra to post-process every single bid-request in near-real time in the most scalable way. Events from all different data centers in the world are mirrored via Kafka, which is by default a nice distributed storage.

As scaling is important for us, we needed a fast and solid solution, which can post-process data in terms of filtering, enriching, aggregating and storing data. It was important for us to have horizontal scalable components in every piece of our architecture, so Spark Streaming was the perfect match along Kafka and Cassandra. The main advantage of Spark Streaming is that it ‘s easy to develop and maintain, its blazing fast compared to batch processing in Map/Reduce and most importantly, it’s fun.

Why Cassandra

We evaluated Cassandra against Couchbase and MongoDB, which was already deployed in our system. In terms of motivational factors, to name a few, Cassandra gives us: excellent write performance, a more flexible query language compared to many document based databases, scales linearly to pretty much any size and performance requirement, supports multiple datacentres, relatively low latency, no single point of failure and Spark integration is excellent!

Deployment

Our deployment currently resides in 1 data center, with 2 terabytes of data spread across 6 nodes. We’re just starting to use Cassandra and replace more and more existing databases and pure HDFS files with dedicated representations in Cassandra. We plan to rollout Cassandra in other data centers as well.

Cassandra gives us high performance, continuous updates, and querying of large time series datasets with strong transactional guarantees. Cassandra also offers way better querying support than a pure key-value store.

Real world recommendations

Definitely try to learn how Cassandra keys work by heart, think about key distribution and what your datasets will look like in a few months. Cassandra scales and performs well if used correctly, but it can’t magically fix your mistakes.

Always start by figuring out every single query your database needs to support and if you come from the world of relational databases, your brain will need to be rewired because the data is modelled in a fundamentally different way. Trust me, my brain was rewired the day I realized that what I’m actually dealing with is a REALLY fancy hash map.

Lastly, read articles and blog posts by people who run Cassandra in production to get real-world insights on tips, tricks and pitfalls and consider using compact storage, even on 2.1. For adsquare and me personally, Cassandra boasts a strong and helpful community, which is an invaluable source of information.

Join adsquare

We’re a technology driven company and our asset is infrastructure – in order to improve our technology stack and infrastructure and build the most sophisticated and scalable data marketplace for mobile advertising, we make sure our team is made up of the best talent. We’re looking for experienced Big Data engineers, platform architects and high potentials who help us to build an outstanding platform. So if readers are interested in working for one of Berlin’s expanding, innovative tech startups they can check our current openings.

Synchronizing Clocks In a Cassandra Cluster, Pt. 2: Solutions

June 26, 2015

By 

Viliam Holub, Chief Technology Officer at Logentries
I am a researcher and a teacher at the University College Dublin in the School of Computer Science and Informatics, formerly from Distributed Systems Research Group, Charles University. My research interests cover formal verification and parallel and distributed model checking, however I am interest in many other fields of software engineering and computer science. I studied Electronic Computer Systems within 1994-1998 where I got an electrotechnical background. Then I graduated in Software Engineering from Faculty of Mathematics and Physics, Charles University and finished my postgraduate study in the Department of Software Engineering. I like programming and I programmed a lot so far. I’m fluent in C, C++, Perl, PHP, Java, and x86, I’m perfectly familiar (although haven’t written so much code yet) with Python, Ruby, and ECMAScript, and I’m able to read a lot of others. I don’t feel stuck in one programming language.

This is the second part of a two part series. Before you read this, you should go back and read the original article, “Synchronizing Clocks In a Cassandra Cluster Pt. 1 – The Problem.” In it, I covered how important clocks are and how bad clocks can be in virtualized systems (like Amazon EC2) today. In today’s installment, I’m going to cover some disadvantages of off-the-shelf NTP installations, and how to overcome them.

Configuring NTP daemons

As stated in my last post, it’s the relative drift among clocks that matters most. Syncing independently to public sources will lead to sub-optimal results. Let’s have a look at the other options we have and how well they work. Desirable properties are:

  • Good relative synchronization; Required for synchronization in the cluster
  • Good absolute synchronization; Desirable or required if you communicate with external services or provide an API for customers
  • Reliability and high availability; Clock synchronization should survive instance failure or certain network outages
  • Easy to maintain; It should be easy to add/remove nodes from the cluster without a need to change configuration on all nodes
  • Netiquette; While NTP itself is very modest in network bandwidth use, that’s not the case for public NTP servers. You should reduce their load if feasible

Configure the whole cluster as a meshMesh of NTP servers

NTP uses tree-like topology, but allows you to connect a pool of peers for better synchronization on the same strand level. This is ideal for synchronizing clocks relative to each other. Peers are defined similarly to servers in /etc/ntp.conf; just use the “peer” keyword instead of “server” (you may combine servers and peers, but more about it later):

We define that nodes c0-c2 are peers on the same layer and will be synchronized with each other. The restrict statement enables peering for a local network, assuming your instances are protected by a firewall for external access, but enabled within the cluster. NTP communicates via UDP on port 123. Restart NTP daemon:

And check how it looks like in ntpq -p:

This setting is not ideal, however. Each node acts independently and you have no control over which nodes will be synchronized to. You may well end up in a situation of smaller pools inside the cluster synchronized with each other, but diverging globally.

A relatively new orphan mode solves this problem by electing a leader each node synchronizes to. Add this statement in/etc/ntp.conf on all nodes:

to enable orphan mode. The mode is enabled when no server stratum less than 7 is reachable.

This setup will eventually synchronize clocks perfectly to each other. You are in danger of clock run-away however, and thus absolute time synchronization is suboptimal. NTP daemon handles missing nodes gracefully and therefore high availability is satisfied.

Maintaining the list of peer servers in NTP configuration and updating it with every change in the cluster is not ideal from a maintenance perspective. Orphan mode allows you to use broadcast or manycast discovery. Broadcast may not be available in a virtualized network and, if it is, don’t forget to enable authentication. Manycast works at the expense of maintaining a manycast server and reducing resilience against node failure.

  • + relative clocks (stable in orphan mode)
  • – absolute clocks (risk of run-away)
  • + high reliability (- for manycast server)
  • – maintenance (+ in auto-discovery orphan mode)
  • + low network load

Use external NTP server and configure the whole cluster as a pool

Given clock run-away as the main disadvantage in the previous option, what about enabling synchronization with external servers and improving relative clocks by setting up a pool across nodes?

The configuration in /etc/ntp.conf would look like this:

As nice as it may look like, this actually does not work as well as the previous option. You will end up with synchronized absolute clocks but relative clocks will not be affected. That’s because the NTP algorithm will detect an external time source as more reliable than those in the pool and will not take them as authoritative.

  • – relative clocks
  • ? absolute clocks (similar as if all nodes are synchronized independently)
  • + high availability
  • – maintenance
  • – high network load

Configure centralized NTP daemon

The next option is to dedicate one NTP server (a core server, possibly running on a separate instance). This server is synchronized with external servers while the rest of the cluster will synchronize with this one.

Apart from enabling the firewall you don’t need any special configuration on the core server. On the client side you will need to specify the core instance name (let it be 0.ntp). The /etc/ntp.conf file must contain this line:

All instances in the cluster will synchronize with just one core server and therefore one clock. This setup will achieve good relative and absolute clock synchronization. Given that there is only one core server, there is no higher availability in case of instance failure. Using a separate static instance gives you flexibility during cluster scale and repairs.

You can additionally set up the orphan mode among nodes in the cluster to keep relative clocks synchronized in case of core server failure.

  • + relative clocks
  • + absolute clocks
  • – high availability (improved in orphan mode)
  • – maintenance (in case the instance is part of the scalable cluster)
  • + low network load

Configure dedicated NTP pool

Pool of NTP serversThis option is similar to a dedicated NTP daemon, but this time you use a pool of NTP servers (core servers). Consider three instances 0.ntp, 1.ntp, 2.ntp, each run in a different availability zone with an NTP daemon configured to synchronize with external servers as well as each other in a pool.

The configuration on one of the core servers0.ntp would contain:

 

Clients are configured to use all core servers, i.e. 0.ntp-2.ntp. For example, the /etc/ntp.conf file contains these lines:

By deploying a pool of core servers we achieve high availability for the server side (partial network outage) as well as for the client side (instance failure). It also eases maintenance of the cluster since the pool is independent from the scalable cluster. The disadvantage lays in running additional instances. You can avoid running additional instances by using instances already available outside the scalable cluster (i.e. static instances) such as a database or mail server.

Notice that core servers experience some clock differences as if each node is separately synchronized with external servers. Setting them as peers will help in network outages, but not so much in synchronizing clocks relatively to each other. Since you have no control over which core server the client will select as authoritative, this results in worsening relative clock synchronization between clients – although significantly lower than if all clients were synchronized externally.

One solution is to use the prefer modifier to alter NTP’s selection algorithm. Assume we would change the configuration on all clients:

Then all clients will synchronize to 0.ntp node and switch on another one only if 0.ntp is down. Another option is to explicitly set increasing stratum numbers for all core servers assuming that clients will gravitate towards servers with lower strata.  That’s more of a hack, though.

  • + relative clocks
  • + absolute clocks
  • + high availability
  • + maintenance
  • + low network load
  • – requires static instances

Summary

If you are running a computational cluster you should consider running your own NTP server. Letting all instances synchronize their clocks independently leads to poor relative clock synchronization. It is also not considered good netiquette since you unnecessarily increase load on public NTP servers.

For a bigger, scalable cluster, the best option is to run your own NTP server pool externally synchronized. It gives you perfect relative and absolute clock synchronization, high availability, and easy maintenance.

Our own deployment synchronizes clocks of all nodes to a millisecond precision.

Synchronizing Clocks In a Cassandra Cluster, Pt. 2: Solutions” was created by Viliam Holub, CTO at Logentries.

1 2 3 154