content/posts/2023-02-05-book-site-reliability-engineering.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

---
title: "Book - Site Reliability Engineering"
date: 2023-02-05T15:53:43Z
draft: false
params:
  comments: true
---

I've read [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) (SRE) from Google/O'Reilly. It's an interesting insight into how Google scales their operations work.

A core theme of the book is ensuring that 'operations' work i.e. managing servers, computers, networks, hardware and applications scales [sub-linearly](https://stackoverflow.com/questions/32311924/what-are-sublinear-algorithms) with both the number of users of a service, and the number of services the company provides. The book is really a series of shorter articles about how Google accomplishes this through technology, business processes and personal interactions.

A lot of the guidance in the book seems more applicable at large scales (100s of engineers) rather than smaller organizations. For example; configuring extensive monitoring to check services are meeting their 'service level objectives' (SLOs) and alert when they're not can be lots of work, especially if the objectives are not extremely well defined to begin with. It can be hard to justify this work alongside delivering the actual minimum product which will satisfy the customer demand. That's not to say monitoring should be ignored completely until later, but getting monitoring done 'right' to the standard shown in the book is likely out of reach for organizations without dedicated SRE team.

Some advice seems useful regardless of scale, for instance holding meaningful post-mortems on incidents, and having at least some basic incident response plans.

The real take-away messsage for me is: outsourcing as much as possible. When SRE isn't your core capability, use hosted or fully managed services wherever possible and leave the operations work to companies that specialize in it. This might be public cloud services like Amazon Web Services or Google Cloud Platform, however in my experience those platforms still end up requiring dedicated teams to manage them; for example managing Identity & Access Management (IAM) can get complex very quickly. Using 'infrastructure as code' (IAC) tools like terraform can help to keep the complexity under control, but these tools bring their own cognitive overhead as well.

Services which offer to handle _all_ the infrastructure concerns, like [darklang](https://darklang.com) or [shuttle.rs](https;//shuttle.rs), or [webapp.io](https://webapp.io/) are very attractive for this reason. See my previous post on ['serverless'](https://mfashby.net/posts/2022-09-09-serverless/) for some thoughts about those! If I was to have a great idea for a web-based SAAS and I built it, I would likely choose to use one of these services; probably shuttle.rs.