The 5 Whys in Action: How to Turn Human Mistakes Into Lasting Improvements
When humans inevitably slip, the 5 Whys uncover the real fixes hiding beneath the surface. The 5 Whys expose our weak systems, not our weak people.
Welcome to the Scarlet Ink newsletter. I’m Dave Anderson, an ex-Amazon Tech Director and GM. Each week I write a newsletter article on tech industry careers and tactical leadership advice.
Free members can read some amount of each article, while paid members can read the full article. For some, part of the article is plenty! But if you’d like to read more, I’d love you to consider becoming a paid member!
Let’s go back in time a bit. I was sitting in an Amazon conference room, drinking perhaps my fourth coffee. I used to drink a lot more coffee before I had my caffeine overdose situation.
As a slight digression, drinking too much coffee might make it difficult to get a good night’s sleep. Drinking far too much coffee, like 10+ in a day, might lead to you feeling like you’re being electrocuted for 6 hours straight. Not cool dude, not cool. Ever since, I’ve carefully limited myself to two coffees a day.
Anyway, we were in the middle of some boring meeting, discussing project status or something similar. Then my pager went off. Back in my early days at Amazon, we carried physical pagers. Those pagers were ancient tech even then, but reliable. I didn’t particularly want to be in that conference room, so I happily used my pager as an excuse to leave the room.
However, since I used that excuse, I felt like I should go figure out what was going on. My team could usually handle emergencies, but I decided to check on things.
When I got to our team area, the engineers on my team looked a little frantic. Now what’s funny in these types of emergencies is that you rarely have all your team members at their computers. When they’re really confused and frantic, what you end up with is one engineer typing as fast as they can, while three other engineers hover behind their shoulder, pointing at things on the screen. In normal situations, you don’t need three helpers to read a computer screen and point at things. I asked what was up. I’m always tempted to say, “What’s up, Doc?” - but the engineers on my team are too young to know Bugs Bunny.
“The site's down. We can't figure it out. We were swapping out some hosts. Some metrics went haywire, and the alarms started going off.”
I’ll talk later about the actual root cause and how those metrics tricked the engineers. But the more important point I wanted to make is the approach we took to investigate and prevent similar issues from happening ever again.
One of the things I think Amazon does really well is that it gives engineers and managers deep hands-on experience with operations. Instead of a full time operations team, our builders also operate their software. This gives us an insight into not just how systems work, but how they fail.
One fascinating observation is that almost every operational disaster I can think of began with a similar first step. A human made a mistake.
Someone wanted to push code to production, but chose the wrong branch, or didn’t push the configs, or pushed the wrong configs.
Someone wanted to restart servers, but they restarted the wrong ones, or did it too fast, or used the wrong script.
Someone wanted to migrate data, but their query missed a subset of customers, or didn’t filter out the test customers, or deleted instead of migrated.
Humans make mistakes. Now, if I wanted to name one solid advantage Amazon culturally has over many companies, it’s that Amazon is root cause obsessed. Employees religiously hunt for the root cause of complex situations, and catalog the actions necessary to fix those root causes.
The 5 Whys is the name of a core process used at Amazon to repeatedly ask, “But why?!” when something went wrong. Amusingly, it’s also how 2-year-olds act, probably for the same reason. To understand.
Humans make mistakes.
A central part of Amazon’s advantage over other companies is recognizing this as a fundamental truth you can’t avoid.
One team I managed was drastically understaffed when I came in as the manager. We had around 20 employees. 5 months later, we had more than 60 employees.
We were running a slightly fragile live service. Can you imagine how many mistakes were made when you hired 2/3rds of your team within a few months? It’s inevitable. Most of the team was new, and they fumbled.
New employees don’t know or remember the proper process. They’re not practiced in exactly how to avoid mistakes. They don’t know where the hidden dangers are in the systems they’re touching.
Now this isn’t just about being new. We all can be tired, forget a step, use the wrong script, become impatient, or not realize that our co-worker changed something that necessitated changing our normal processes.
Now when the internet notices an operational event has happened, you’ll always see the Reddit thread where people discuss blame.
"Haha, I bet someone got fired over this one!"
Except at least at Amazon, we know one critical rule. Humans always make mistakes. It’s our systems (software, processes, infrastructure, etc) that can either prevent or allow damage from those mistakes. Our blame (and solutions) can’t be about the humans making mistakes but the systems we’ve built. Because humans will always make mistakes. It’s how you handle those mistakes that defines your results.
And the 5 Whys is how you improve those systems.