Graph Databases and Scalable High Volume Data Relationships

Posted May 17th, 2016

At Geofeedia, we pride ourselves in being a company that has never committed to a particular technology. Rather, we choose the technology that best fits the use case at the time of implementation. As a developer, this is an amazing quality to find in a company. What does it mean for devs? Our leadership puts faith in us to drive our product in the right technical direction with the latest and greatest technologies. This post will focus on one of our recent endeavors to implement an Enterprise Security architecture for our application leveraging a Graph Database.

What is a Graph Database?

A graph database is a database architecture that uses a simple graph structure consisting of vertices, edges, and properties contained within both objects to represent highly complex data relationships. This graph structure allows for a straight-forward implementation of semantic queries. Most types of graph databases stem from a NoSQL type of data store. NoSQL database structures are known for their scalability across clusters, flexibility, and simplicity in terms of implementations (key-value stores or wide column stores). The Graph Database structure takes this NoSQL concept (with all of its advantages) to the next level allowing for complex easily scalable relationships across a high volume dataset.

Simple graph visual example

Above is a simple representation of a snippet of what a graph database of people and cities may look like. There are two types of vertices here: People and Cities. There are three relationships: Friends_With, Lives_In, and Travels_To. Let’s say Person 5 wants to meet a new friend in Indianapolis when they travel there next. We see that Person 3 also lives there, so let’s write a traversal to realize this relationship. We will use the gremlin traversal language (which we use in our implementation) to structure a very basic query. In this query, note that g is the Graph, V means vertex, outE means outgoing edges, inE means incoming edges, and inV and outV are the same as outE and InE, only pertaining to vertices:

 g.V().has('Name', 'Person 5').outE('Travels_To').inV().inE('Lives_In').outV().next();

The previous query essentially states that we want to start at the Person 5 node, check all outgoing edges labeled with Travels_To, get the all vertices on the other end of that edge, then traverse the incoming edges with the label of Lives_In, and then grab all vertices at that edge. Since there is only one vertex for each step of this traversal, it will directly traverse to Person 3 who lives in Indianapolis. Of course, if you scale this out and have many vertices all connected by many more edges, you can start to see how this traversal strategy comes into play.

Our Graph Database Implementation

So why did we choose a graph database to drive our Enterprise Security infrastructure over an SQL implementation? Many variables factored into this decision, but there were two main ones: Speed and Scalability. Our Enterprise implementation has a multitude of relationships that offer a granular level of permissions. At our lowest level, we have business objects (locations, recordings, other users, etc) on which we want specific users to have specific permissions to perform CRUD operations on them. If you have 100 users with 100 locations in an account, and you want 50 of those users to have access to 20 of those locations, there are already 1000 edges for those specific permissions. The edges define the level of permissions that the users have. You can see from this basic example that scaling out from there, the numbers multiply quickly.

The graph database allows us to traverse those relationships quickly adding a minimal amount of overhead for our queries and allowing users to access business objects that they have specific permissions to. Imagine doing something like this with a SQL database. It could certainly be done, but what are we really trying to achieve? We want to traverse a path and retrieve a dynamic set of data along that path describing the traversal. SQL works in such a way that we are actually retrieving a set of data thus it does not properly fit our use case. Beyond that, SQL columns, indices, and overall structure are more rigid and far less flexible than what we need for a feature like ever-expanding permissions.

Scalability comes into consideration when we talk about adding accounts, adding users to those accounts, and adding more business objects created by those users. Beyond that, the users need to be able to share those business objects amongst one another and organizational groups created within their account. The level of complexity scales out quite rapidly. Further, our application has a rapid growing feature set for which we need to manage permissions. Using the graph, we can easily add in new features and support them at scale instead of having to add new columns to our SQL Account table. If we chose to normalize the Account table with settings in our SQL implementation we would then be sacrificing performance by doing numerous joins to achieve something that we can achieve with a simple traversal along a path in the graph.

Which Graph Implementation

We chose to use Titan backed by Cassandra. Both Titan and Cassandra are free and open-sourced. Some main benefits of this implementation are the price, continuous availability with no single point of failure, no read/write bottlenecks, and elastic scalability for the addition and removal of machines. We chose to interact with Titan leveraging a NodeJS microservice we dubbed “Bouncer” as it is a sort of bouncer, acting to protect our application. Not only were we able to rapidly develop the service in Node, but it is also lightweight, easy to maintain, and allows for a strong community backing provided by NodeJS.

Challenges and Looking to the Future

Constantly pursuing the “bleeding edge” of technology comes with its own costs and risks. Innovation poses a risk because with it, we often find a lack of documentation, lack of issue support, and a “blank slate” of implementation possibilities. With that in mind, we still look to push further, learn together, and make something unique and awe-inspiring along the way. This is why we are always on the lookout for brilliant minds, and like-minded individuals willing to push our technology stack forward. If you think you are up to that challenge, follow the link below to find the right opportunity for you at Geofeedia!

By: Will Jaynes

Geofeedia Careers