Ideas from Building Secure & Reliable Systems (book)

May 18, 2020

I recently read Building Secure & Reliable Systems, a book written by a team from Google. There's many interesting ideas inside. As you might imagine, many approaches are applicable only if you have significant scale and resources like Google does.

There are plenty of concepts useful for any team though. Here are a few that jumped out at me.

Understandability and simplicity are crucial to security

If something is simple and understandable:

It's less likely a change will lead to a defect.
It makes responding to incidents easier.
It increases confidence about security. For example, it is easier to reason about.

No surprise, but it's nice to see it front and centre.

Load shedding

Consider a system made up of different instances. If one instance receives a lot of traffic and can't keep up, how should it behave?

In the naive case, it keeps accepting traffic until it falls over. When that happens, the traffic shifts to another instance, which likewise falls over. You end up with a cascading, uncontrolled failure.

A better way is to stabilise at a load where the instance won't crash. A system should stop sending traffic (e.g., return errors) to that instance once it reaches a certain load.

The trick is how to do this. In many cases you won't know exactly where your system will have issues. For example, not every request may trigger the same behaviour.

Compartmentalise across different dimensions

The idea is to limit blast radius. The book talks about compartmentalising these ways:

By role: For example, a service account has access only to what it needs.
By location: For example, a service in one data centre can only access things in that location.
By time: For example, by rotating credentials. If credentials are compromised and they get rotated, they become invalid.

Degrade instead of fail

The idea is to have multiple versions of system components, each less featureful but more robust. They are backups that you switch to instead of downtime.

High capacity: This is your normal system. It may do things like talk to external databases and have features that are nice but not critical.
High availability: This is your normal system configured with less dependencies. For example, it might operate with cached data rather than depending on an external database.
Low dependency: This is a separate implementation, one that is simplified and with fewer features.

Each level down degrades service further, but avoids an outage. Like load shedding, there's a question about when to switch over to these.

Limit dependency on time

Specifically, an external idea of time. The idea is that a dependency on this can cause unexpected behaviour, so be careful about it.

Consider a system that verifies signatures during deploys where verification depends on time. If the system is in an unexpected state, its notion of time may be broken. It may be unable to verify a signature that would otherwise be considered valid and you might be unable to deploy.

Monitor all files

Google's servers are deployed with a listing of all files on them and the checksums each should have. The servers monitor their files to verify this continues to be the case. If one changes unexpectedly, it's reported and repaired.

It sounds attractive, though I wonder how it can be performant. Presumably there are many exceptions, such as for data files that are expected to change during normal operation.

Use strong types

And avoid primitive types like strings. This avoids passing conceptually invalid parameters. They also argue for strongly typed languages. No argument from me!

Supply chain security

All code and artifacts to be deployed should be signed and verified for provenance, and systems should reject deploys that lack these. This avoids things like deploying malicious code as well as mistakes, such as building a binary on your laptop with debug flags enabled and deploying it.

An interesting aspect of this is that you have to sufficiently lock down the build system for it to be effective. If anyone can modify the build system, then it's possible to subvert the process.

Minimize the impact of any single bug

Since it's impossible to prevent all security bugs, design systems such that the impact from them is limited. For example, by having multiple layers of defense, or by limiting the amount of trust in a system by having well defined boundaries.

What goes along with this is that you should plan for protections to fail. This means being able to detect when that happens and recover from it. Apparently many compromises go undetected for long periods of time. Scary.

Conclusion

This is a worthwhile read. It gets a lot of things right and shows how far you can take security given a lot of resources. It makes me feel more confident relying on Google for things like GCP and Chrome (there's a chapter in the book about the Chrome team).

The One and the Many