The On-call for a software product is the person or team which is "on-call" for dealing with operational events.
Operational events are events which impact the operations of a system. It may go slower, start returning the wrong values, or become unavailable. This may impact customers directly, as in the case of a website. It may impact customers indirectly, as in the case of a backend billing system.
The on-call is responsible for knowing enough to get the system working properly in the short term. They patch short-term problems to remove the immediate negative impact to systems.
The general goal for the on-call is to keep things working while an engineer implements the long-term solutions.
The primary exception to my text above is that on-calls often do required maintenance as well. For example, you may need to upgrade the version of Python on all your hosts. In this case, the on-call is performing the long-term solution by upgrading Python. I'm going to ignore this use case for my discussion below.
The two main models of on-call
While there is a long list of variations, there are two major models of on-call.
Owner-operators. This is sometimes called DevOps, but I've found that word to be overloaded. This is the model where the same engineering team writes and operates their code. At Amazon, we'd call it end-to-end ownership.
Operations teams. This is the model where there are dedicated staff members who maintain systems. These full-time employees are hired to deal with operational problems full time.
As a side note, before you join a company as a software engineer, you will want to ensure that you have a good understanding of how they handle on-call responsibilities.
Amazon uses an owner-operator model for most teams and functions. I'll walk through the impacts of this model, for better and for worse. Likely due to my time at Amazon, my personal bias is for the owner-operator model. While acknowledging the sometimes painful consequences, I appreciate the end results.
Impact of owner-operator — Natural consequences
This is my favorite reason for using this model. Most issues in production come from code changes. We would be reminded of this every Q4 at Amazon. We would have a code lockdown, and mysteriously, our operational events fell by an order of magnitude.
There are many reasons someone might write buggy code. They might be rushed, lazy, poorly skilled, unfamiliar with the language, made poor assumptions about a dependency, didn't consider scaling, or other reasons.
I distinctly remember a college hire sending one of their first code reviews to an experienced engineer on the team. The experienced hire asked them to fix an edge case in their code.
College hire: "Oh, that's an edge case, that will probably never happen."
Experienced engineer: "First, at our scale, edge cases always happen. Second, when it breaks, it will happen at 3am and wake me up. Fix it."
As an alternative, I was talking to a customer of AWS once about their system. We were observing some strange patterns in their host utilization. This pattern had been going on for a couple of years.
We eventually figured out what was going on. They had a massive memory leak in their software. Their engineers didn't know about it, although it had existed for years. Their operations team had simply learned to reboot all hosts every 3-hours to resolve the issue. Crazy.
When team members know that they and their close team members are personally maintaining the code they write, there is a natural pressure to ensure a higher quality product.
Impact of owner-operator — Deep operational understanding
I've talked to engineers who had the support of an operations team. They often had a mind-boggling lack of understanding of how their systems were used in production.
How much were their monthly hardware costs? They didn't know.
Were they CPU bound, memory bound, IO bound? No idea. The other team handled that.
If they wanted to cut their hardware costs, what would they need to improve? Impossible to say. They didn't know how much hardware they had, or what factor required them to scale more.
I believe you can only write high quality, complex software if you fundamentally understand how your software operates in production at scale.
A positive natural consequence of personally operating and scaling your software is that you understand how it works. There are other ways to understand your software, but this mechanism is simple.
We had a multi-hundred thousand dollar monthly hardware cost on one of my teams. As our customer base increased, the engineers on the team would regularly scale our fleet to avoid causing customer impact. While the product could easily justify it, the cost felt excessive for what the software did.
As the engineers themselves were responsible for scaling the fleet, they became curious about why the software was so greedy. They added a task to the next sprint to investigate.
The engineers knew that the fleet had very low memory and CPU utilization. The issue was that the machines were IO bound. Being IO (input / output) bound means that the machines would max out their ability to transmit data in and out of the host. The hosts were mostly idle, but were transmitting huge amounts of data.
When you're working with microservices (small web services), you often have dozens of dependencies, thousands of objects being serialized/deserialized, and small details can be missed. What they discovered in their investigation was a simple mistake.
One object we were using (and sending around to various services) was not customized for our use case. Rather than having the minimal set of data necessary for the operation, our software was sending around the entire object. That entire object included some binary data (related images and sometimes video).
Essentially, we would say, "What permissions do I need to access this content?", and the service would say, "Oh! I can help you with that! Here's the history of everyone that ever edited this content, several neat images, a fantastic video, and the permissions."
It was subtle but clumsy coding. It was hidden in layers of abstraction, which made it difficult to recognize the error. Someone reading the code wouldn't immediately notice the issue. Someone operating the system might not recognize that something was fundamentally wrong. By combining those two jobs, the team was able to save millions every year.