How Incidents Affect Infrastructure Priorities

On Friday night we had an issue at Braintree that took our website offline for a brief period (it was braintreepayments.com, not the payment gateway). After an incident like this, it’s good for the team to get together to review the postmortem. What went wrong? What went right? What can we do better in the future? Inevitably, the discussion generates a list of projects for the team to do to improve reliability and reduce time to recovery.

One important question for the team to ask after generating the list is: where do these projects fit in with other priorities?

Usually, teams have a tendency to want to push the new projects to the top of the queue. It’s understandable: you don’t want the same thing to happen again. It’s worth taking a step back though, and figuring out if the new projects are really the top priority for the team.

Prioritizing Infrastructure Projects

When prioritizing infrastructure projects that are designed to address risk, it’s helpful to think about the chance of something going wrong and the cost to the business if it does. It’s the standard risk management formula:

probability of incident * impact of it happening = cost of risk

Of course, it’s impossible to determine precise values to plug into the formula. I wouldn’t use this dogmatically, but it’s a helpful guideline.

How an Incident Affects the Risk Formula

When deciding if an incident should shift priorities, you need to look at the probability and impact. It’s common for the impact to go up after having an incident: customers or stakeholders may be forgiving of something going wrong once, but if it happens again, they’re likely to be more concerned.

For the probability, start with asking about whether the things that went wrong were known risks or unknown risks. If they were unknown, then the probability is higher than previously thought. Ultimately, you need to figure out whether the previously determined probability of an incident is still correct, or if the value needs to change.

Result

You’re probably not going to go through this exercise and end up with a magical number that tells you exactly what to do. However, following the postmortem discussion, take a methodical approach to changing priorities before you let projects jump to the top of the queue.