From b1308157f821fd5f864f553cb2c5f26e589cf377 Mon Sep 17 00:00:00 2001 From: Martin Ashby Date: Sun, 5 Feb 2023 16:37:15 +0000 Subject: SRE book --- .../2022-12-27-book-site-reliability-engineering.md | 11 ----------- .../2023-02-05-book-site-reliability-engineering.md | 17 +++++++++++++++++ 2 files changed, 17 insertions(+), 11 deletions(-) delete mode 100644 content/posts/2022-12-27-book-site-reliability-engineering.md create mode 100644 content/posts/2023-02-05-book-site-reliability-engineering.md (limited to 'content') diff --git a/content/posts/2022-12-27-book-site-reliability-engineering.md b/content/posts/2022-12-27-book-site-reliability-engineering.md deleted file mode 100644 index 84996e7..0000000 --- a/content/posts/2022-12-27-book-site-reliability-engineering.md +++ /dev/null @@ -1,11 +0,0 @@ ---- -title: "Book - Site Reliability Engineering" -date: 2022-12-27T16:16:43Z -draft: true ---- - -I've been reading [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) from Google/O'Reilly. It's an interesting insight into how Google scales their operations work. So far I'm about 1/3 of the way through. - -I'm reading this looking for tips to apply at my current job. It's fairly plain that most of the advice and stories are relevant to a huge organization with sprawling complexity, but also enormous resources to manage it. It's easy to see how some advice like holding meaningful postmortems for incidents, and having and maintaining incident response plans, and having extensive monitoring is possible and useful at Google, but less clear which pieces could be applied at a smaller organization. - -A secondary take-away is outsourcing as much as possible: when SRE isn't your core capability, and you aren't big enough to need it, use hosted / fully managed services wherever possible; taking away as much of the maintenance burden as possible. diff --git a/content/posts/2023-02-05-book-site-reliability-engineering.md b/content/posts/2023-02-05-book-site-reliability-engineering.md new file mode 100644 index 0000000..496d7b8 --- /dev/null +++ b/content/posts/2023-02-05-book-site-reliability-engineering.md @@ -0,0 +1,17 @@ +--- +title: "Book - Site Reliability Engineering" +date: 2023-02-05T15:53:43Z +draft: false +--- + +I've read [Site Reliability Engineering](https://sre.google/sre-book/table-of-contents/) (SRE) from Google/O'Reilly. It's an interesting insight into how Google scales their operations work. + +A core theme of the book is ensuring that 'operations' work i.e. managing servers, computers, networks, hardware and applications scales [sub-linearly](https://stackoverflow.com/questions/32311924/what-are-sublinear-algorithms) with both the number of users of a service, and the number of services the company provides. The book is really a series of shorter articles about how Google accomplishes this through technology, business processes and personal interactions. + +A lot of the guidance in the book seems more applicable at large scales (100s of engineers) rather than smaller organizations. For example; configuring extensive monitoring to check services are meeting their 'service level objectives' (SLOs) and alert when they're not can be lots of work, especially if the objectives are not extremely well defined to begin with. It can be hard to justify this work alongside delivering the actual minimum product which will satisfy the customer demand. That's not to say monitoring should be ignored completely until later, but getting monitoring done 'right' to the standard shown in the book is likely out of reach for organizations without dedicated SRE team. + +Some advice seems useful regardless of scale, for instance holding meaningful post-mortems on incidents, and having at least some basic incident response plans. + +The real take-away messsage for me is: outsourcing as much as possible. When SRE isn't your core capability, use hosted or fully managed services wherever possible and leave the operations work to companies that specialize in it. This might be public cloud services like Amazon Web Services or Google Cloud Platform, however in my experience those platforms still end up requiring dedicated teams to manage them; for example managing Identity & Access Management (IAM) can get complex very quickly. Using 'infrastructure as code' (IAC) tools like terraform can help to keep the complexity under control, but these tools bring their own cognitive overhead as well. + +Services which offer to handle _all_ the infrastructure concerns, like [darklang](https://darklang.com) or [shuttle.rs](https;//shuttle.rs), or [webapp.io](https://webapp.io/) are very attractive for this reason. See my previous post on ['serverless'](https://mfashby.net/posts/2022-09-09-serverless/) for some thoughts about those! If I was to have a great idea for a web-based SAAS and I built it, I would likely choose to use one of these services; probably shuttle.rs. \ No newline at end of file -- cgit v1.2.3-ZIG