This week’s outage at Cloudflare took down a significant percentage of the internet. Cloudflare has posted a placeholder post-mortem that hopefully will continue to flesh out details of what went wrong and why.
I think there’s a lot to learn in studying specific instances of fragility. At my shop we host blameless post-incident retros that utilize the 5 Whys technique for each outage we have. While nobody hopes for incident, I look forward to our retros to learn, to measure the maturity of my shop’s ability to retro and learn, and to spawn conversations with my fellow technologists about fragility and ways to harden or mitigate risk from it.