Inside a High-Severity Outage: A Late Night Sev-2 at Amazon
A detailed narrative of how we communicated about, worked on, and resolved a technical mess we created.
Welcome to the Scarlet Ink newsletter. I'm Dave Anderson, an ex-Amazon Tech Director and GM. Each week I write a newsletter article on tech industry careers, and specific leadership advice.
Free members can read some amount of each article, while paid members can read the full article. For some, part of the article is plenty! But if you'd like to read more, I'd love you to consider becoming a paid member!
I recently did a podcast with Gergely Orosz, the creator and author of The Pragmatic Engineer newsletter. During the podcast, one topic we repeatedly touched on was operational events. As a side note, it’s worth watching that podcast. It was fun to do!.
I’ve interviewed a thousand+ people from various companies for Amazon, and perhaps the most consistent skills gap for incoming engineers and engineering managers was their depth and experience managing operations.
Most software people know how to build features, write requirements, and at least lightly manage their project plans.
But in my experience, many engineers and engineering managers lack real-world experience with operations.
When I refer to operations, I mean operating software. At Amazon, almost all teams were complete owners of their software, including deploying, maintaining log rotations, host replacements, package updates — the works.
But it turns out that many companies don’t operate that way. (Pun intended)
I figured it would be interesting to share the experience of a relatively typical team at Amazon experiencing an operational event. While this is not a description of an exact outage event (since I wouldn’t want to leak Amazon details), this article comes from aspects of various operational events I was a part of over the years.
This article is more technical than my average article. It’s a little unavoidable. Apologies if I lose anyone. And I purposefully wrote about a more typical team (not an AWS outage), because I think it’s more relevant to the average engineer.
Thursday 5pm: T-minus ~5 hours
Hung stopped by my office at 5pm. He had his backpack on, and was clearly on his way out the door. Hung was a senior software engineer on my team. We ran the Seller Central website, and some related Amazon Marketplace services which supported Amazon’s 3rd party seller business. (As a side note, I did actually run the Seller Central website, which is why it makes for a good example. Their operational events were a big deal, and I had plenty of examples to think through for this article.)
“Hey Dave! I’m heading — Ooh, you got new mints!” he said excitedly, grabbing a handful of my wintergreen mints from my desk.
I was well known for the mints on my desk, but I’d run out a few days prior. I had finally restocked that afternoon. Over the last few days, it was funny how many people had poked their heads into my office, only to look disappointed and wander away awkwardly.
“Yeah, it turns out Amazon does deliver to our building. Fancy that.” I said.
Since it was 5pm, it was finally the end of my 8 hours of straight meetings, and it was time for me to get some personal work done. I had just started reviewing someone’s project proposal document. But I liked that my team felt like chatting with me, so the interruptions didn’t bother me.
Hung popped a mint into his mouth, and then talked around it. “Just wanted to let you know that Jose is doing our code push tonight. It’s his first. I think he’ll do fine, but he has my number in case.”
Our team’s standard code deployment process was to have the on-call push any code changes out at around 10pm, which was our compromise between relatively low customer traffic, and a reasonable time for an engineer to do some work before bed.
Jose was a junior engineer with about a year of experience on the team. He had finally joined our on-call rotation, which meant he was now also responsible for doing said deployment.
“Sounds good.” I said. Because I wasn’t overly concerned. Our code promotion process was well documented. And we don’t exactly run medical software. The worst-case scenario is that some merchants get annoyed at us. It’s a good philosophy to have in life. I like to keep in mind that so many people have much more serious jobs than us tech workers.
Hung waved goodbye, and headed out the door. I went back to my document review.
Thursday 10:32pm: Event + 32 minutes
I was lying in bed reading The Name of the Wind when my phone rang. My heart sank. That’s never good. The only phone calls you get at night are work or personal emergencies.
“Yes?” I said into my phone.
“Hey, it’s Hung.” the voice on the phone said with a depressed tone of voice. “So the code push went bad, and we have a partial outage. I’m trying to get the site fully back up. But no luck so far. I thought I better call you.”
“Ok, thanks Hung, I’ll join the call shortly.” I said, and then hung up.
For context, I knew there would be a conference call open for the event because that’s just what you do. So instead of wasting anyone’s time asking more questions, I just wanted to get my laptop running and get on the call.
Thursday 10:33pm: Event + 33 minutes
A few moments later, my laptop was booting on my kitchen table, I had water heating on the stove for some tea, and I was dialing into the conference call. I’d found the call information in my email inbox.
The phone made a boop noise, and I was on the call.
“It’s me.” I said into the phone.
“Hey Dave.” Hung said. “We also have Bertie on the call, and Jose.”
Jose, as I already mentioned, was a junior engineer on our team. Bertie was a senior engineer from a peer team of ours. Bertie, like many other good software engineers, had attached themselves to the high severity ticket paging system. This meant she got an email for any high severity event for our broader organization, which allowed her to join in when she felt interested. It wasn’t uncommon for unrelated engineers or managers to join in high severity events to help out.
“Hey everyone.” I said. “My laptop just booted up, so let me skim the ticket.”
I read through the ticket. It essentially said that Jose had done the code promotion, and we had immediately received a Sev-2 (high severity) page as parts of our site had begun to see outages.
Jose had stopped the deployment, executed the rollback steps, but the alarms kept tripping. Thankfully, he’d called Hung at that point. Hung had verified that the rollback appeared to have been successful, but he couldn’t see why the site continued to malfunction. He’d tried a few minor things, then logged in the ticket that he was calling me.
“Ok, I see that we’re getting errors.” I said. “Do we have a clear understanding of exactly what is broken?”
“Not really.” said Hung.
“Our team’s services are up, but your website seems intermittently down.” Bertie said. “Our services are working, but getting less traffic.”
“Did anyone send out an outage notification yet?” I asked.
“Not yet.” said Hung.
“Ok.” I said, ready to start things moving. “I’ll send out the outage notification. Hung, keep trying to figure out how to get the site working. Jose, please page in the rest of our team. And Bertie, if you could take a look around, it’d be appreciated. Everyone please keep putting regular updates into the ticket.”
A chorus of acknowledgements, and we all got to work. In general, most high severity calls have a single leader running the shots. Usually, that person tries not to do too many personal investigations, but instead coordinates to make certain all necessary work is taking place.
And I’ll point out that no one was assigned to find the bug in our latest code release. It’s a part of proper operational handling. You roll back your changes, and try to figure out what went wrong later. The order of operations is always fix first, investigate later.