Making software and software development teams. Mostly the people parts.

Large-scale Incident I'm Learning From

The Rogers Commission Report.

We’ve had a period of fragility at work, nothing out of the ordinary for a fast-diverging tech shop but still something we’ve had strategic focus on and operational imperatives towards solving. So I’ve been consuming content about how large-scale failures are handled.

The Challenger disaster and the subsequent Rogers Commission Report are a large-scale example of incident and post-incident response. The story continues to the Columbia disaster and reveals how little of the Rogers Commission Report recommendations were implemented, but what was interesting to me here is it got me thinking about a few things: Who plays the role of Richard Feynman in my shop? Does every incident need a Feynman, or are some or even many incidents straightforward enough to not require it? What is the cost of someone going full-on Feynman, not just in time but on morale? Etc.