Posted May 6th, 2016
At Geofeedia we are true believers in using the best tool for the job, and we don’t typically concern ourselves with committing to a particular programming language, software framework, or cloud. This post will focus on that last point. Specifically, we have been operating in a multi-cloud environment for a little more than 6 months now that spans Amazon Web Services (AWS) and Google Compute Engine (GCE). While our initial motivations were driven by a cost-savings measure this ultra high-availability approach is a huge component of our architecture plans moving forward.
First some background
The Geofeedia Platform has always been entirely run in the cloud. In the early days we used Rackspace then transitioned to AWS several years ago. About a year ago we launched real-time analytics based upon Elasticsearch. Elasticsearch is an awesome tool, but it requires a fair amount of time investment to truly optimize the configuration and hardware for your specific needs. In our case we have almost 2 billion social items dating back to 2011. Performing aggregations on a dataset this size requires some serious hardware. At the time we launched our new solution we knew that we had to find a more cost effective configuration since this new cluster was accounting for almost three-quarters of our overall cloud costs.
When seeking out a better long-term hosting solution for our Elasticsearch cluster we opened our search beyond just AWS for the first time. It was important to find a solution that offered the right ratio of CPU/SSD/RAM – even if that meant a bare-metal approach – and a growth model that allowed us to pivot as we learned more about Elasticsearch and expand the offerings of our platform.
Our initial implementation was based on EC2’s i2 instance types. These machines offer a great ratio of CPU, RAM, and fast local SSD storage. However, in the end we were left with an over-abundance of SSD. Somewhere around 8x more storage than we actually needed for our data. The other alternative on AWS was the r3 instance types, but they ended up having just under the amount of storage we needed. Now knowing more precisely our needs we opened up our search to Elasticsearch Found (Elastic’s hosted solution), GCE, and Azure. As you can probably guess from the title, we ended up finding that GCE’s n1-highmem instance types in addition to the ability to attach a custom number of 375GB local SSD’s proved to be the winning combination for our needs.
This is where GCE excels in my opinion. It’s important to understand that in order to get the best price with AWS you must reserve instances. They have fairly complicated options where you can pay nothing, some, or all up front for either a one or three year commitment. In the world of agile software making a one, let alone a three year prediction about your hardware needs is extremely difficult. We find ourselves turning over servers on about a 6 month basis. GCE on the other hand requires no long-term commitments. If you run an instance for the full month you get the absolute best price.
I explain cloud computing to my family as computing as a utility. Just like your electricity or water. You pay for what you use. Ironically, “you pay for what you use” isn’t entirely true for the AWS model and we find ourselves having to spend a lot of time doing the math and predictions to forecast our AWS costs and determining whether we should make a reservation or not. I couldn’t imagine spending the same mental effort on my home utilities. Another nice feature of GCE is that instances are charged on a minute basis (minimum of 10 minutes) instead of hourly like on AWS. If you are doing a lot of rolling over of compute instances this could add up for you.
Networking and firewalls
The goal for our cloud network is for it to be entirely private with the exception of the http(s) ports on our web load balancer. Production engineering staff access the VPC using OpenVPN with two-factor authentication. No other traffic should be allowed to enter that private network. The important takeaway here is that we almost never use public IP’s for our virtual machines. Eventually we would like to remove public IP’s from our networks in order to guarantee that they aren’t used. This approach removes a large attack surface from our infrastructure. I should note that we use the cloud VPN features of GCE and AWS to connect our two clouds into one large VPC with encrypted communication between clouds. More on this in a later post.
AWS does a good job of supporting these private network needs. You have the option of launching instances without a public IP address. Load balancers can be provisioned with an internal IP address for internal service communication. GCE on the other hand is externally biased. All load balancers are public facing. Their command line tool, gcloud, has a command for ssh’ing into an instance, but it defaults to the public IP and doesn’t provide an option for using the internal IP. We are eagerly awaiting GCE to release updates in this realm.
AWS is region centric, while GCE is project centric. I find myself preferring GCE’s approach here, especially if you are building a multi-region solution. In AWS you typically manage all your virtual machines in a single region and provide high availability by running services in multiple availability zones. At some point you’ll want the high availability benefits of a multi-region approach. In the AWS web console you would switch your entire context to that new region and would not see any of the resources, including EC2 instances, VPN configs, and firewall rules to name a few. It’s almost like an entirely new account. In order for your instances to communicate across regions you would need to use public IP addresses – breaking our previous stated goal – and pay for ingress/egress network traffic between the regions. A single GCE project spans all regions, globally. All resources are visible in the same web console context, a single private network is provided, and there are no ingress/egress network charges between regions. As I understand it, Google can accomplish this because they have dedicated fiber between their regions. And here you thought they were just giving away free fiber Internet to be nice.
Managing the firewall rules in any environment is an involved process. This is another place where GCE shines in my opinion, but AWS has made some recent improvements that brings them on par. In AWS you assign a security group to a machine at boot time and you can manage the inbound and outbound firewall rules on the security group which can apply to multiple machines. Until recently, you could only have a single security group on an EC2 instance and could not change the security group on a running instance. I’m glad to see AWS has added both of these features. GCE firewall rules are based on tags for the internal network. I find that I prefer this approach because it’s easy to think about your machine roles based on tags and most importantly they offer a single page view of all the firewall rules in your network. I don’t know of a way to see all the rules in a single view in the AWS console.
Example of creating a global tag-based firewall rule in GCE.
CLI’s, API’s, and more
One last difference that I have noticed is in the command line tools and API SDK’s offered by each vendor. It’s somewhat subjective but I have found myself using GCE’s gcloud command line a lot more than I have ever used AWS’s aws-cli. I stand on the side that says a UI will only slow you down compared to a command line tool. It’s much easier to automate a CLI too. One tangible reason I have avoided the AWS command line tool for more complex tasks, like launching an instance, goes back to the firewall rules. In order to launch an instance in a VPC on EC2 you have to provide the VPC subnet id and EC2 security group id. These aren’t exactly values you can memorize.
While we haven’t used the GCE API’s extensively, you can tell that they rely on auto-generation heavily and it’s been more difficult to find code examples online. The AWS API’s are top notch and tailored quite well for each language. This is most likely a reflection of the age and maturity of the AWS SDK’s and we expect GCE to continue to improve in this space.
Instance launch times on GCE have been noticeably faster. We haven’t benchmarked it, but it seems instances are accessible in literally half the time.
Our strategy moving forward involves running more services across both clouds in order to achieve the best high availability we can for our customers. That perspective is why most of this article focuses on the compute offerings of each provider and doesn’t go into the details of managed services. Competition between cloud providers only yields better solutions for companies like ours, and we are grateful for the amazing platforms that both AWS and GCE have built. We can’t wait to see what’s next.
By: Charlie Moad
Oh yeah. Come work with us.