TL;DR warning – this article is about 1500 words. I prepared this for Shlomo Swidler's panel "Writing Code for Many Clouds" at CloudConnect 2010.
At Sonoa, we have an enterprise product which we turned into a service called Apigee. From the first, we needed to move beyond just being packaged as a VM and “deployable anywhere” to really living in the cloud.This is what we’ve learned so far – some of which we anticipated and some of which we reacted to. Build, deploy, and manage – of the three basic parts of running a service only deploy and manage really change. The big difference is in operationalization of the system.
Most recently we realized that we needed to be HA across providers and get total control of our latency, so we are building a new datacenter on Rackspace as well. This is a work in progress so I’ll be reporting from the front lines.Finally, we’ve helped implement a multi-cloud architecture for ING which has taught us something about where multi-cloud services may be headed.
The first cloud: EC2
1. Build:We realized at the outset that we wanted to build a service that would be portable, so we chose not to use the least portable features of AWS, such as S3. While it would have made our life simpler for some of the assets we were managing, there was no corollary (and Walrus, the Eucalyptus storage subsystem that mimics S3, does not count as a corollary, even though it really works). We did use EBS (Elastic Block Storage) which is so close to a SAN that we felt it was reasonably standard; and forcing our hand was the fact that we needed to solve for persistence and performance.
The first and biggest step for any system will be building it as a VM if you haven’t done so before. Once you have done this, you can practically drop it onto any box. You’ve become independent of the hardware and other aspects of the operating environment.Beyond build, you have to focus on: setting up the network topology, configuring the virtual boxes once they’re up, and managing the result.
2. Deploy:The next phase is figuring out how you bring up instances in your cloud platform. EC2 has its own interfaces for this, and Rackspace has different ones. Rightscale normalizes these interfaces and provides a UI. There’s an open source package with no UI that we evaluated but aren’t using called libcloud.
Now that you’re hardware independent, you can run as many instances of your service’s components as you can afford. The main solutions here are Chef and Puppet, both open source. We use Capistrano for scripting automation.Then you need to configure the topology of the different subsystems you’ve built. Here things get interesting. EC2 does not support multicasting across your default virtual network; this was tough for us and would be for anyone relying on clustering. VPN-Cubed from CohesiveFT let us build a private network within our EC2 environment and let us do the multicasting we needed.
Once your network is up and you can push software, it’s just the same as having your own private datacenter. You can connect from anywhere, manage instances, and get alerts and reports.3. Manage:
That brings us to management. We use Nagios for monitoring our virtual boxes. We learned that we needed to have a separate machine outside of EC2 as a “monitor monitor” – a Nagios instance that monitored the health and responsiveness of the Nagios box in the cloud environment We use RightScale for managing all of our accounts and instance creation. With this setup we’ve had zero downtime since our launch in late August of last year.
But the evils of cloud computing were present as well as the the good. EC2 does not guarantee the availability of an instance, but the availability of a zone. As a result we found that the latency of our service had a high degree of jitter (between 5 and 15ms), which was acceptable but not ideal. The lack of control in this environment means that we’ve been buying instances ahead of our need in order to guarantee not just availability but performance. This is one of the headaches that cloud computing is supposed to transcend.In a nutshell – “it’s elastic but you have to manage it.”
So in order to manage the network performance issues (achieve constant performance AND availability) we realized that we needed to go multi-cloud. We also realized that our core service principle – we’re a cloud service gateway and active proxy for people's API traffic – meant that we had to have a “strongest link” architecture so that no set of failures at a single cloud provider could take down our service.
We’re now building on what we’d anticipated and developing a new instance of our service at Rackspace. The big changes here are the level of control we have out of the gate for network topology, process isolation, CPU performance… and price, which is higher.
The second cloud: Rackspace
1. BuildArchitecturally the big differences are database replication and cross-provider load balancing. This places really specific requirements on your networking design and technology as well as your database design.
One of the things our service does is store all of our customers’ cloud API traffic for their later use in analytics. Thinking about data modularly helps with replication. In a replicated world we need to break out types of datasets – such as customer information and service configuration – into smaller chunks that can meet higher-speed replication requirements cost-effectively, and break them away from all the historical traffic data. Even the traffic data needs to be handled differently in this world.We are now sharding the database into circular tables, where the incoming data is always written to a write-only area, and revolves to the next area every five minutes. In our user base a 5-minute delay on analytics is more than acceptable (compare this with the SLA for Google Analytics), and the working set of data used for traffic management is handled separately in realtime. All of this means that we can have either a hot standby or live-live dual-cloud configuration without breaking our customer promise that they can tweak their service at any time at all, and that their analytics are consistently available. This will also let us evolve both sides of the service as it grows.
2. DeployDeployment tooling stays the same – our old friends RightScale and Capistrano are used to spin and configure instances.
On the networking side, obviously you need to connect your clouds securely in order to replicate between them as well as to exchange performance data which can be used for load balancing. We found again that VPN-Cubed helps us establish a trusted connection between our heterogeneous cloud environments.
3. ManageSince we are using standard monitoring and management tools – Nagios, RightScale, and Capistrano – these all work in both environments, and our approach of using a “monitor monitor” doesn’t change.. although now we need to monitor monitors in each cloud.
Is there an easier way?For an infrastructure play like Apigee, we don’t think so. Given our customer promise of near-zero and predictable latency we need as much control as possible. For an application-level service play though, we think some parts can be easier. We’re built on Sonoa technology that manages all of our cloud API traffic processing, as is ING, a financial service company that’s moving to cloud. Their challenge is elasticity in financial modeling – specifically the Monte Carlo simulation workload which is compute-intensive and highly intermittent in use of resource. When you’re running the simulation, you need all the compute resource you can get. When you’re not running a simulation, you need almost none.
Cloud infrastructure like EC2 and Rackspace take care of the racking & stacking problem associated with scaling up for Monte Carlo. You still need to manage that with a tool like RightScale or libcloud plus your configuration and deployment tool of choice. But at the higher level where you’re load-balancing between clouds you don’t necessarily need a VPN, as there’s no data replication requirement. At this layer they’ve implemented a secure API which is called by internal clients, and then this API request is load-balanced by Sonoa’s API gateway. The gateway then calls the right cloud based on policies set by the monitoring and scheduling software. So in this situation you are monitoring your cloud instances and letting the API gateway handle the dirty work of dispatching and securing the calls.10 Lessons Learned From Building to Multiple Clouds:
1. Get everyone comfortable with virtualization fundamentals, from developers to admins.2. Limit your dependency on provider-specific APIs by using 3rd party tools that manage this for you.
3. There may be SLAs on your cloud instances but there are no SLAs on the APIs your cloud providers give you.4. Refuse to use services that have no corollary in other clouds. It will cost you more in rearchitecture than you gain by using it.
5. Understand the cost trade-offs for your business of the different clouds’ strengths – especially in the dimension of availability, price, and performance.6. Anticipate your needs for data replication and design your databases accordingly.
7. Pay attention to your networking requirements and network topology.8. Consider the granularity of the requests that you need to load balance – is it at the service or API layer or is it finer-grained than that?
9. You’ll still buy more than you need but the waste ratio is much less in the cloud.10. Monitor the monitors!