Roll Back or Fail Forward

Updated: April 6th, 2024

How to decide between rolling back or failing forward (or fix-in-place) as a policy for a given experience in a given environment.

Rolling back is generally the more mature habit. It enables everyone to take a breath, identify and fix problems (ideally with test coverage) in a low-pressure environment. Failing forward, on the other hand, is cowboy coding. And it's fast, and if it works it generally lets teams move on to the next problem much quicker and gets value in the hands of users faster.

Often I've found myself in situations where an engineering squad is making an argument for why they should fail forward in a given environment.

For me, as an engineering leader, I use the following framework:

If, in a given incident, users and stakeholders of the experience would be fine with an order of magnitude (10x) the downtime you've taken, then failing forward is fine.

Example 1: A production retail site. A typical incident might be 12 minutes long. 12x10 = 120 minutes. During that time the business would lose an estimated $50k in sales. That's not acceptable, therefore the policy should be to roll back failed changes.

Example 2: A non-prod environment is used by a vertical squad (PM+Design+Engs) to demo changes to stakeholders once a week. Same typical incident time but in this case, 120 minutes of downtime is fine unless of course it happens to overlap with the weekly demo. Failing forward is an acceptable policy in this case, trusting the team to manage around that weekly demo (policy can be written such that the team gets autonomy to choose between rolling back or fixing in place by accepting the accountability of managing around their own schedule... this is probably a separate post).

A counterpoint is when you have a team that's too aggressive for your current state. A while back I was on a team that had been acquired, and many of us were still acting like it was a startup even after revenue in some streams had 100xed. That team needed a strong counterpoint to its previous behaviors, and so "never fail forward in any environment" became our tacit policy.

The tacitness there is important. As always, the best policy is often no policy. A team that expects to be told what to do in every situation needs either more capability or more confidence or both. A team that requires it is functionally useless other than as a feature factory. First principles and a strong values-based decision making culture, on the other hand, obviate the need for explicit policy in every space.

Final thought: If you do create policy, enforce it. Will Larson has the final word on this.

posted in Software Engineering