If Your System Never Fails, That’s the Real Failure
Investments in operations should be at a minimal level that maintains the systems in an acceptable state forever. Poor operations are clearly bad news. But perfect operations are a waste of money.
Welcome to the Scarlet Ink newsletter. I’m Dave Anderson, an ex-Amazon Tech Director and GM. Each week I write a newsletter article on tech industry careers and tactical leadership advice.
Free members can read some amount of each article, while paid members can read the full article. For some, part of the article is plenty! But if you’d like to read more, I’d love you to consider becoming a paid member!
I feel like I’m on a bit of an operations kick in my writing. I’ll make sure to pivot next week :)
I started at Amazon in 2007 in the Global Payments organization. You could argue that it’s difficult to find a more important organization than the one in charge of payments.
During our first team meeting, the engineers complained that we weren't allocating enough time to improve our older systems. We were too focused on building new features, expanding in new regions, etc.
After a couple of years, I moved to the Retail Marketplace team to run Seller Central. This was the main website Sellers used to run their businesses. The profits from Marketplace sustained the rest of Amazon (until AWS started to take that job over). So this website was absolutely a key piece of infrastructure.
The engineers there were quick to point out that our systems needed some serious investments to maintain them. They consistently complained that we were neglecting those old systems.
I moved to Facebook after a few years. I learned a lot about what was different, and what was the same. At Amazon, engineers are assigned to teams and projects. At Facebook, engineers generally picked what they wanted to work on. Interestingly, cleaning up code wasn't something many people picked.
Even when engineers could choose what to work on, the most common complaint from engineers was that someone needed to spend more time cleaning up the code. Their systems required more investment in operational excellence.
So funny enough, with two almost opposite engineer allocation models, the nearby engineers insisted that we should spend more time making our systems operationally sound.
When I joined AWS, I heard the same thing. Devices? The same thing. Games? The same thing. I’ve literally never spoken to an engineering team that was happy with the quality of their finished product. They have always said that we should spend more time fixing their system. (Just not necessarily that specific engineer, because that work sounded awfully boring.)
I began my career as a software engineer, and I spent a significant amount of time as an engineering leader advocating for investing more in improving the operations of the systems I owned.
I’m now going to explain why it's healthy to be unhappy with the quality of your operations.
Why should we care about operational excellence at all?
When we say operational excellence (or the quality of operations), I’m generally referring to software systems supporting customers. So operational excellence is about making your systems operate well.
There are two major categories of value you get from investing in operational excellence.
Customer experience: Customers expect your systems to operate properly. Slow systems or broken systems are bad for customers. How bad? Depends on your type of customer.
Engineering investments: Fragile systems are expensive to maintain. Responding to operational events (like AWS’s most recent event) is incredibly costly.
Considering the downside of having bad operations, the general arguments for investing in improving operations are clear. You have customers, and they have expectations. Those expectations include systems that continue to work, orders that continue to be fulfilled, and mobile apps that open quickly. Engineering teams prefer to spend their time building new software. If they're stuck restarting machines all day, they won't be building new software. They'll also probably quit, because that job is pretty terrible.
I've had engineers ask about the clear discrepancy between the clear value of operational improvements and where the organization invests every year.
"Why is leadership so stupid? Can't they see we should invest more time in our operations?"
My usual response is that there are two options. You can believe that everyone senior in the company is not intelligent, and they don’t understand how to run software.
Or you can believe that you work with intelligent people (particularly our leadership teams at Amazon were pretty brilliant), and if you don't understand their choices, you are likely missing something. That’s the choice I always went with.
And then I would explain why spending less is frequently a good thing. Even if it doesn’t feel that way.
Why you should minimize your investment in operational excellence
When you run a company, some of your investments are meant to grow your business. Other investments are made to protect your existing business.
When you spend a million dollars advertising, you hopefully gain some amount of new business. If you invest another million dollars, you intend to gain yet more business. As long as the math makes sense, you continue to spend more money. As long as you know your expected ROI, you want to spend as much as possible, to grow as much as possible.
Operational investments are different. If your systems are poorly built and are down 20% of the time, you’ll lose all your customers. Clearly, this is bad for business. But if your systems are down only 0.1% of the time, your customer reaction greatly depends on what type of customer you have. If you’re running an AWS system like DynamoDB, you may need to reduce your downtime to 0.001% of the time before your customers stop being upset.
On the other hand, if you’re running a backend marketing email-sending platform, you can probably regularly break without anyone knowing or caring. Because your customer expectations and requirements are different.
Depending on your business, you don’t generally gain customers by having better operations (with a few exceptions). For the most part, your systems work well enough for customers, or they don’t. There is no prize from customers for almost any business for having the least downtime or the best codebase.
That can feel like a bit of a tragedy to engineers, who generally prefer perfection. Because if a single customer keeps getting weird messages in their mobile app when they drive out of cell service, it’s likely not worth the engineer’s effort to investigate and fix that problem.
What I’m saying is that below a certain point, your system is not good enough to support your business. Above that line, any improvements are not worth the investment.
This is counterintuitive, particularly to engineers who look at quality as a virtue. And I understand that feeling. High quality is largely a good thing. But it’s one that doesn’t give a business return on investment after a certain point.
The ROI of your operational investments comes down to understanding your customers and the cost of operational issues. Good enough operations are necessary. Great operations may or may not be any better. And perfect operations? The business rarely cares.



