Reflections on moving to the cloud

May 18, 2020

At $work we recently completed a migration to the cloud. I was a big proponent of doing this. However it wasn't without hiccups. One of the last things we moved made me worry whether the project had been a mistake.

Thinking about this put things in perspective. While not everything is perfect, we gained a lot. I believe the project was the right move.

Here are the pros and cons as I see them.

Con: Performance

This is my major complaint and one I didn't appreciate enough up front.

Going from bare metal to the cloud is a tough pill to swallow in terms of performance. IO in particular is problematic as there is no good solution to having disks as fast as bare metal. On GCP, the best option for fast disks is local SSDs, but they're ephemeral. If you have a workload that can't easily be split horizontally where ephemeral storage isn't a good fit, such as a database server, this is a problem. You're looking at hard options and are bringing yourself closer to an IO wall.

The platforms have an opportunity to improve here, though I understand why it's difficult.

Con: Loss of control

If you integrate with one of the platform's services and it has downtime, there's nothing you can do. That isn't a good feeling when you have customers relying on you.

However cloud or not, you're reliant on others. Even with bare metal you don't control everything.

I don't see this as a negative with the cloud itself. Rather it is a consideration as you integrate with its services to capitalise on its benefits.

Pro: Security

This is the main benefit. I wrote about this topic before, but to list some ways it's a security win:

Encrypted disks and networking without your effort.
Observe network traffic without doing something like running switches yourself.
Enabling isolation. On bare metal it is not feasible to run separate servers for each service/program. On the cloud you can run small VMs or things like Cloud Functions to do this.
Platform services are a tool for isolation. Your systems can communicate through them as a trusted layer and not have access to each other. Things like Cloud Functions and App Engine enable secure designs.
Platform services save you from running things. If you're not running it, you're not securing it.
Hardware and physical security. Running on bare metal means you're responsible for things like firmware security.
Associating resources with access. For example, what a VM has access to can be tied to its service account, so it is easy to see its access.
Ephemeral resources. Provisioning bare metal servers is harder than provisioning VMs. Reprovisioning frequently means attackers have less chance of retaining access.
Artifacts/immutability. There are opportunities to deploy services from verified artifacts. This gives trust to what's deployed as well as paving the way to have immutable systems.
Security investment. The provider has a strong incentive to invest in security and at their scale much makes sense that doesn't for a small team.
Key management. In addition to letting the platform rotate keys, you can set up your services such that they don't have access to the private key material. If an attacker compromises a key, regular rotation means there's a good chance they'll lose access.
Infrastructure as code. More can be code reviewed, catching malicious actions and security impacting mistakes.

Yes, "the cloud is other people's servers" and there is nothing magical about it. All of these can be achieved without the cloud, but the question is whether they are feasible.

Pro: Platform services

While I mentioned these as a security tool, platform services are also a benefit in building systems. They solve infrastructure needs that you'd be building, running, and maintaining systems for yourself. A few examples:

Cloud Storage: Instead of hosting and dealing with bandwidth yourself.
Pub/Sub: Instead of building a queuing system yourself.
Cloud Functions: Run isolated code.

I look at such services as high quality, reliable tools that let us focus on the necessarily bespoke parts of our services. Make something the provider's problem instead of yours.

Pro: Infrastructure as code

Changes can be reviewed. There's less chance of mistakes and it adds opportunities for collaboration. This goes beyond security.
It makes adding and removing resources easy, potentially even automatic. It opens up options and workflows as well as opportunities to save costs.

Concluding thoughts

An interesting question is whether I would advocate for this project again given what I know today.

The main points against are that it was a major investment and that it accelerated hitting difficult performance questions. However the security gains are significant, and security was the primary reason to do it. It's probably the single biggest security gain we could see, and if we hadn't done it, I would see the benefit of it. I believe I would push for it again.

The One and the Many