Amazon's websites and services rarely fail due to heavy customer load, or the inability for Amazon's servers to scale. Rather, the majority of major outages are caused by manual human error.
I remember after a major incident, our SVP got a number of us together to discuss the event.
VP Sheri: "The outage was caused by a support engineer running a script which was supposed to shut down a small list of servers, but it mistakenly shut down every server -except- those servers. We're talking to that team, someone should have caught that error."
SVP Clara: "The outage was caused because it was possible for someone to mistakenly shut down the entire fleet. People will always make mistakes. I want to hear how we're going to prevent the inevitable human errors from causing this large of an impact.
We know that humans make errors. Yet our natural reaction is often to try to figure out ways to make humans not make errors. Instead, we should be working to improve outcomes when the inevitable errors occur.
The COE Process
There's a process called a COE (Correction of Error) used frequently at Amazon. In that process, you look at an error. You attempt to identify the root causes of the error, and then come up with changes to fix them.