Back that Cass Up

Posted August 24th, 2016

Here at Geofeedia we love to take design and architecture cues from Netflix. One such cue is the adoption and use of Cassandra as a primary datastore. Apache describes Cassandra as,

…the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra’s support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

This post stands as the first in a series of posts discussing how we at Geofeedia are pushing Cassandra and proving out that scalability and fault-tolerance. In particular this post will focus on our strategy for running not only a cross-region Cassandra cluster in the cloud, but also cross-cloud, as well as our attempts to implement a consistent and reliable backup strategy.

Adventures of running cross-cloud

All of the infrastructure we run at Geofeedia is in the cloud. We use a mix of managed services from cloud providers like Amazon Web Servies and Google Compute Engine as well as a large number of open source technologies such as Docker, Kubernetes, RabbitMQ, and of course Cassandra. In order to maintain highly available distributed systems we run many that cross both the AWS and GCE clouds. Cassandra is no exception and we run multiple clusters with nodes that are distributed between those clouds.

In order for Cassandra to know how to distribute replicas of data between its nodes, it uses a mechanism referred to as a snitch. Essentially a snitch, similar to its namesake, provides information to Cassandra to determine which data centers and racks are to be written to and read from. According to Datastax,

Snitches inform Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks

Datastax, who provide a cloud based enterprise offering for Cassandra, has several examples of different snitches one can use for a cluster. Snitches such as RackInferringSnitch, PropertyFileSnitch, EC2MultiRegionSnitch, and GoogleCloudSnitch.

As mentioned before we strive to be not only multi-region in a single cloud, but also multi-cloud. None of the Datastax provided snitches were well suited for the multi-cloud requirements and after some online searching we were unable find any examples of what we are calling a MultiCloudSnitch. So we decided to write our own and combine aspects of the EC2MultiRegionSnitch and GoogleCloudSnitch to create our desired MultiCloudSnitch.

Currently our snitch supports AWS and GCE since those are the clouds that we are currently using, but it is fully extensible and any arbitrary number of clouds can be added with minmal effort. We follow a logical pattern for our datacenter naming in <CLOUD>-<REGION><DC_SUFFIX>. The DC_SUFFIX is defined in the Cassandra configuration and is useful when supporting multiple datacenters. An example datacenter name for our GCE cluster would be gce-us-east1_general. We then use the current cloud availablity zone that we are running in as the rack name.

Our implementation of the MultiCloudSnitch is open source and avaiable to hack on as needed.

Developing a reliable and consistent backup strategy

All database admins and IT operations teams out there are aware of common adage that sounds something like,

All those backups seemed a waste of pay, now my database has gone away

Like most companies, Geofeedia’s data is its lifeblood and without it we would be in some serious trouble. As such, we take our backing datastore architecture seriously and our strategies for backing up that data equally serious. When we made the move to Cassandra we began developing that backup strategy. Out of the box Cassandra provides a tool aptly named nodetool that is indispensable when administering production ready Cassandra clusters. Nodetool provides a snapshot command that, you guessed it, takes a snapshot of the data of one or more keyspaces, or of a table. We run these snapshots nightly and the nodetool command saves the snapshots on the local file system in the same place as the actual data itself under backups and snapshots directories. Our first attempt was to write a tool in Go that would watch for inotify events on the file system that would then upload those newly added snapshots and backups to either S3 in AWS or GCS in GCE depending on the location of the node. Unfortunately this approach was not quite reliable enough for us because of inherent race conditions in that it was possible that our tool was unable to setup a watcher on a newly created snapshot before the inotify event was fired. Like our MultiCloudSnitch, this tool is also open source and can be referenced if desired.

Our second attempt was much simpler and actually just takes advantage of the AWS and GCE command line tools for the respective clouds. Examples of the commands that we run nightly as well can be seen below,

# GCE
$ gsutil -m rsync -r -x "tmp.db" /data-dir gs://gce-cass-backups/$(hostname -s)
# AWS
$ aws s3 sync /data-dir/ s3://aws-cass-backups/$(hostname -s)

This was extremely successful and much more reliable than our original tool. In the end KISS came out on top as it often does.

The main downside to this simpler approach is that it requires the Cassandra nodes to have either the AWS or GCE command line tools installed, but in comparison to the complexity involved with writing our tool, this was a very small tradeoff.

Conclusions

Ultimately when we made the decision to move towards using Cassandra as a distributed datastore we knew we would need a way to run across multiple private clouds and the ability to maintain and administer a reliable data backup strategy. Fortunately we were able to accomplish this and have been very pleased thus far. While this post focused on more aspects of how the DevOps team is handling our use of Cassandra in an enterprise distributed systems environment, future posts will include more discussion through the lens of some of our extremely talented developers.

If this post piqued your interest, please check out our open positions! We’re always looking for additions to our awesome development team.

By: Austen Lacy