I've joined with Ethan Evans' Discord community to help build a great place for people to come and talk about their career, interviewing, leadership and more. To read a little background and my motivation behind this, you can read it here on LinkedIn.
I'm always interested in hearing more from my readers about the topics they'd like to see covered. If you're interested in a topic, please reply to this newsletter and let me know. The 5 whys was a frequent reader request, for example.
Finally, I'd like to thank the hundreds of readers who have become paid members. Your support is powerful motivation for me to continue writing this newsletter. I really appreciate it! If you'd like to support this newsletter, please click here. Thanks!
I was sitting there with my coffee, discussing project status, when my pager went off. Back in those days, we used fashionable pagers to notify us of emergencies. If I remember right, I didn't particularly want to be discussing project status, so I happily used my pager as an excuse to leave the room.
I got back to our team area, and the engineers on my team were looking a little frantic. My identification of frantic is one engineer typing as fast as they can, while three other engineers hover behind their shoulder, pointing at things on the screen. You don't usually need three helpers to read a computer screen. I asked what was up.
“The site's down. We can't figure it out. We were swapping out some hosts. Some metrics went haywire, and the alarms started going off.”
I'll later cover the root cause and how the metrics lead the engineers astray. For what it's worth, the root cause itself doesn't matter for the whys, I'm just going to include some details for the technical folks in the audience.
Every operational disaster I can think of begins with a similar first step. Someone made a mistake. Pushing code to production, they pushed the wrong branch. Restarting servers, they restarted the wrong ones. Migrating data, their query missed a subset of customers.
If I had to name a single advantage Amazon has over many companies, it might be that Amazon is root cause obsessed. Employees at Amazon religiously hunt for the root causes of complex situations, and the actions necessary to fix those root causes.
The 5 whys is the name for the process used to repeatedly ask But Why!? You may not have realized it, but acting like a 2-year-old is a great way to build robust systems.
Fundamental assumption — Humans make mistakes
Amazon has an amusing advantage over other companies in recognizing this truth. I remember one team I managed was drastically understaffed when I joined the team. They had around 20 employees, but I grew the team to around 60 employees in a 5-month period.
Can you imagine how unreliable your team would be, if you hired two thirds of your team within the last few months?
This truth exists across Amazon. The number of employees in organizations consistently grows by 20-30% year after year, which means there are many teams which grow faster than this.
New employees won't know the proper processes to follow. New employees will not realize the hidden dangers of your system.
When a large operational event happens, I'll see discussions on Reddit, wondering about blame.
"Haha, I bet someone got fired over this one!"
Amazon leaders have learned one vital rule.
Humans make mistakes. It's our systems which allow the mistakes to cause damage. Therefore, our blame is not on the humans making the mistakes, but the systems we've built.
Cause and effect
The cause and effect chain begins with the obvious. In the above situation, you might start with The site went down.