When bad things happen, we love to look for the explanation.
"The launch didn't meet our expectations because of X."
"The outage happened because of Y."
"We made that bad hire, because of Z."
Having a single simple explanation is satisfying to the explainer and listener for a couple of big reasons.
- Blame is avoided with singular explanations. Even if your company is understanding of mistakes, no one enjoys the feeling of admitting a mistake. If you point at a single simple explanation, chances are that most people are then blameless. Our ego appreciates it when we dodge responsibility.
- Simple is easier. It can be mentally exhausting to understand a big, complex project. Imagine the complexity of unraveling exactly where things went south. But you're saying that it failed because we didn't do user testing? Ok, that makes sense! Big mistake! We'll do user testing next time. Let's move on.
However, that single simple explanation is almost always a fiction story. It's not the full coverage of what happened. It's a convenient explanation which avoids the complex (but valuable) steps of identifying everywhere that things could have gone better.
I'll walk through a fairly simple situation which happened while I was at Amazon, and how the single simple explanation turns out to be far from accurate.
Bad things happened.
BEEP! BEEP! BEEP!
I rolled over and pushed the silence button on my pager (Yes, Amazon still used physical pagers, at least at this time) as quickly as I could to avoid it waking up my wife. I looked at the time.
I got out of bed, and pulled up my phone. I needed to see what was going on. False alarm, or a real event?
My emails suggested that this was a real event. Multiple alarms were going off for our system, customer behavior alarms (less customer activity being recorded, a good sign that something's gone haywire), and various other health alarms. Yeah, something was going wrong.
I got my laptop booted up, and started reading the correspondence. As a senior manager, I don't get paged into all events. This one had lasted long enough without being closed that it had escalated to me.
The on-call had been debugging the issue for 30+ minutes. They had already woken up a few engineers from our team to help debug the issue. It looked like our service was having latency problems, memory problems, and throwing all sorts of errors. It was a strange kaleidoscope of things blowing up in the middle of the night.
As I finished catching up, one of them mentioned something which caught my attention.
Engineer - "Banana Service pushed a change right at the start of our outage. Is there any chance at all that Banana service is causing this event?"
(For context, Banana service was not its real name. But it was a critical service our team's service relied on. Amazon is turtles, all the way down with their APIs.)
It felt like a bit too much coincidence to me. An Occam's razor scenario.
Before I could speak up, another senior engineer made me happy by speaking my opinion first.
Senior engineer - "No idea if it could cause an issue, but page in the Banana service team. Let's ask them to roll their change back immediately."
That's always the right answer operationally. Investigate later, fix now.
Around 10 minutes later, the Banana service changes were rolled back. Our outage immediately stopped. Whew!
We agreed with the Banana service team to figure out what broke the next day. I wrote some brief updates to our leadership team essentially saying, "We'll figure it out tomorrow", and closed the event.