Amazon's websites and services rarely fail due to heavy customer load, or the inability for Amazon's servers to scale. Rather, the majority of major outages are caused by manual human error.
I remember after a major incident, our SVP got a number of us together to discuss the event.
VP Sheri: "The outage was caused by a support engineer running a script which was supposed to shut down a small list of servers, but it mistakenly shut down every server -except- those servers. We're talking to that team, someone should have caught that error."
SVP Clara: "The outage was caused because it was possible for someone to mistakenly shut down the entire fleet. People will always make mistakes. I want to hear how we're going to prevent the inevitable human errors from causing this large of an impact."
We know that humans make errors. Yet our natural reaction is often to try to figure out ways to make humans not make errors. Instead, we should be working to improve outcomes when the inevitable errors occur.
The COE Process
There's a process called a COE (Correction of Error) used frequently at Amazon. In that process, you look at an error. You attempt to identify the root causes of the error, and then come up with changes to fix them.
While writing that document related to a negative event, one key process to follow is called the 5-Whys. Technically it's not exactly 5 of them. It's about asking as many whys as you can, before running out of ideas. The why questions pertain to finding the root cause of an event, rather than the first obvious cause.
Why did the site go down?
All the servers were turned off.
Why were the servers turned off?
A script was run which shut them off.
Why was a script able to turn off so many servers?
There are no permission differences between turning off 1 server vs many.
Why did the team not notice immediately what was happening?
There were no alarms on the number of servers shutting down in a time period.
Why was someone manually shutting down servers?
Because the hardware for those servers was being replaced, and there was no UI for the hardware team to use when replacing hardware.
While forming this list of questions, you're generating a list of actions / tasks that someone can perform to improve the outcome in the future. This is the key benefit of the COE process. It focuses not on human error, but on mechanisms which improve outcomes.
So when you have other events at work or at home, you should be asking yourself the 5-Whys. Why did my sales pitch not work? Why did my kid stay up too late? Why did my grant pitch fail? The first answer you choose ("Because my kid always stays up too late") is also not usually a helpful one. Think harder. What is behind that answer? And what's behind that?
An outcome is a measurement of a moment in time. It's the output of the processes involved, random luck, people's actions, and time passing. There is an outcome of your website going down. There is an outcome of your kid cleaning up their dishes. There is an outcome of your book getting published.
Having a negative outcome doesn't mean that anything was faulty. You could have raised your kid perfectly, yet that little human might still write with crayon on the walls. You may have written the perfect marketing plan, but your product still might not be purchased.
Having a positive outcome doesn't mean that you've done everything perfectly either. You could have a terrible project plan, yet launch on time. You could write some horrible code, which miraculously never fails.
We put too much focus on outcomes. We look at our financial accounts too often, watching the market take them up and down. The market goes up, you feel happy. The market goes down, you feel gloomy. I gain a few subscribers, I'm thrilled. I lose a few subscribers and I'm sad.
When something goes wrong, we want to impact the outcome. "I wish I had more money in the bank", "I wish my boss promoted me". Our natural reaction is to feel negative about the outcome, which is something we can't change.
Rather than being upset over the outcomes not being what we intended, we need to focus attention on the inputs to the outcome. The mechanisms behind the outcomes. In what ways can I improve the outcomes next time?