Lessons Learned From Twenty Years of Site Reliability Engineering - Nick the Sick's blog. Writings, projects and ideas.

# Lessons Learned From Twenty Years of Site Reliability Engineering - sre.google Synced: [[2023_11_30]] 6:03 AM Last Highlighted: [[2023_10_28]] Tags: [[Explainer]] [[Software]] ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article3.5c705a01b476.png) ## Highlights [[2023_10_28]] [View Highlight](https://read.readwise.io/read/01hdtk2tzedbvgczv2vnwbbad0) > We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity. In a best case scenario, a risky mitigation resolves an outage. In a worst case scenario, the risky mitigation misfires and the outage is prolonged by something that was intended to fix it. Additionally, if everything is broken, you can make an informed decision to bypass standard procedures. [[2023_10_28]] [View Highlight](https://read.readwise.io/read/01hdtk3yegerdtkp500gxhydss) > Had we [canaried those global changes](https://sre.google/workbook/canarying-releases/) with a progressive rollout strategy, this outage could have been curbed before it had global impact.