The 5 Impacts of the Owner-Operator Model Regarding On-call, Operations, and Incentives.

By owning software end-to-end, at the expense of occasional annoying operational tasks, engineers are encouraged to build and maintain software the right way.

Share
The 5 Impacts of the Owner-Operator Model Regarding On-call, Operations, and Incentives.
Norway. Photo credit: Me

Welcome to my newsletter! I'm Dave Anderson, an ex-Amazon Tech Director and GM. I write this newsletter I've called Scarlet Ink, which is a weekly newsletter on tech industry careers and tactical leadership advice.

Free members can read some amount of each article, while paid members can read the full article. For some, part of the article is plenty! But if you'd like to read more, I'd love you to consider becoming a paid member. All of my articles are intended to be evergreen (readable and valid forever). Some weeks I have fresh content; other weeks I'll update/rewrite something from 4+ years ago because I want to keep the quality of all articles high.

There are parts of Amazon that didn't sit well with me. For example, it was culturally appropriate to look the other way when leaders were casually cruel to employees with stupid ideas. Because "we should be nice" wasn't a leadership principle, but "are right a lot" was.

That being said, I loved many other aspects of how Amazon operated. A big one was how we fully bought into the owner-operator model for software teams. With rare exceptions, if you built software, you also supported it.

I'd certainly say that it's important as a software-related employee to know how your company handles support. Neither model is objectively better or worse, but it's a major factor in what your work life will look like.

The two main options for software support.

While there is a long list of variations, there are two major models of how companies deal with their software support.

Owner-operators.

This is Amazon's model. We frequently called it end-to-end ownership. An engineering team was responsible not just for requirements, design, build, and deploy, but also for support. This means you weren't just building features. You would also decide what hardware your software would run on, what alarms to set, how frequently your health checks should look at things, what constituted a big enough emergency to wake up your VP, how to message customers in the event of an outage, etc.

Operations teams.

This is the model taken by many large tech companies. In this case, there are employees who are dedicated to handling software support. They're usually called the operations team.

At a high level (with significant variation company by company), the feature teams would build their software, and then push it in the general direction of production. Then the operations team would take over, supporting the software in production. How that handoff works (and if it happens before or after deployment) varies.

What do you mean by software support and operations? Terminology time!

Yeah, let me briefly cover what I mean in case there's any confusion. And because it's fun.

When software runs on the internet, it's operating on the internet. This is why supporting software is called operational support, or "Ops sucked last week, I had to wake up like 9 times."

When your software is operating, many things can go wrong. Things going wrong is called an operational event.

An operational event can mean that your website is going slow, or returning the wrong values, or is absolutely unavailable. These events may impact customers directly (as in the case of a website disappearing), or may impact customers indirectly (as in the case of a billing system not billing customers on the right date).

In the case of owner-operator teams, the main support work is usually a rotating job. This rotating job is called the on-call. This on-call position is frequently the job of a different engineer(s) every week. In the case of Amazon, you would typically be responsible for being on-call 24/7 for one week every 7-10 weeks. But this greatly varied based on the team, and other companies handle the owner-operator model differently.

The on-call would be responsible for triaging emergencies and handling small to medium problems on their own. They would need to know enough to diagnose a issue and determine if more people needed to be involved. The expectation is that they would at least apply a short-term patch to issues to remove the negative impact on customers.

The general goal of the on-call position is to isolate the random operational interruptions and keep things working while the rest of the team can focus on long-term building.

When not dealing with emergencies, on-calls also tend to do required operational maintenance. This might mean rotating security keys, or upgrading a version of Python in production, or updating all the alarms to meet a new system SLA.

That all said, let's walk through the impacts (good and bad) of using the owner-operator model.

1. Natural consequences and incentives.

This is my absolute favorite reason for using this model.

Every Q4 at Amazon (while I was there), we had some form of a code lockdown. Essentially, as we approached peak days (like Black Friday), there would be a preemptive code freeze to avoid issues.

And this wasn't just a matter of precaution. Amazon wide, our operational events fell by an order of magnitude. Large organizations that had 40 or 50 major events a week suddenly had only 2 or 3 major events in those peak weeks.

This is because the vast majority of operational events are not random. They're related to a code change. If you stop making changes, problems stop happening.

What this means is that the majority of operational issues came from an engineer writing buggy code. Why would they do this? Are they incompetent?

Nah, they're not bad people. But they might have been rushed. Or lazy. Or were new to the team and had a mistaken assumption. Or they didn't realize how their system scaled.

Imagine that you're in charge of making a large organization create fewer operational issues (without simply locking down code changes).

A lecture titled "Stop building bad software!" probably won't work. How can you create a culture that encourages caution?

I distinctly remember a college hire sending one of their first code reviews to an experienced engineer on the team. The experienced hire asked them to fix an edge case in their code.

The college hire responded, "Oh, that's such a weird edge case. That will probably never happen."

The experienced engineer sighed. "First, at our scale, edge cases always happen. Trust me, we've seen it all. Second, when it breaks, it will happen at 3am and wake up the on-call. Fix it."

The consequences of poor software directly hit the people responsible for writing code. Any value in a lecture about code quality from some senior leader pales in comparison to thinking about mistakenly waking up your peer in the middle of the night because you were lazy.

As a comparison, when I worked in the AWS organization, one of our customers had an operations team model at their company. In AWS's internal aggregate utilization reports, we'd noticed a very strange pattern of spiky utilization. It had been going on for years, but we eventually decided to mention it to this customer. We figured something didn't look right.

We helped them debug, and eventually figured out what was going on. They had a massive memory leak in the software of their main product. The engineers had no idea it existed, although it had existed for years. Why? Because their operations team had learned that restarting their AWS instances every 3-hours resolved the problem. So they'd long since created automated restart scripts, which had been running for years.

And yes, while this did "solve" the problem for them, it made their hosting costs higher than they needed to be because continually restarting instances isn't efficient.

When team members know that they and their close team members are personally maintaining the code they write, there is a natural pressure to ensure a higher quality product. No lecture is needed.