Having moved from a GM/Director position at Amazon to doing my own newsletter has prompted a lot of questions from people. One of the most common lines of questions is why I chose to do this, and am I expecting this to grow to exceed my compensation as a Director. In other words, is this going to be a profitable decision for Dave?

In short, senior leaders at Amazon make a lot of money. I don't project any future where my compensation for writing remotely approaches the pay for working for a large corporation. That's ok! I picked this as a lifestyle choice. I cooked breakfast for my kids this morning, and I'm usually available when they come home from school. As soon as I post this article, I plan to spend an hour in the garage exercising. Saying 'enough' is hard, when my peers are continuing to earn gigabucks, get promotions, and taking exciting new roles. Yet we all need to reach 'enough' someday.

On the topic of making money, I've had hundreds of people sign up for paid memberships. It's awesome to have my work appreciated by so many people. Big thanks to you all! If you wanted to support my work, and find these articles helpful, please consider signing up for the yearly membership rather than monthly. The transaction fees on monthly memberships ends up sending a higher percent to Stripe than to my bank account. I discounted the annual membership to account for my preference. Thanks all!

It's a challenge to monitor the behavior of hundreds of millions of customers. Amazon has become a metrics driven company by necessity. Each leader learns how to craft their metrics dashboards to look at their customer base from many angles.

If you're looking at the latency (page load time) metrics, you might look at the 10% fastest, 10% slowest, average, latency segmented by region, latency for logged in vs not logged in customers, and the list continues.

With the right list of dashboards and metrics, you can discover many subtle problems.

Yet there are individual issues which metrics will never display. The customers experiencing these issues fall below the radar of broad scans. You can see a problem impacting a million customers, but not those which impact 100 customers.

Taken by themselves, they may not appear on the surface to be a large enough issue to warrant attention. Yet they are often a signal of deeper problems, or an area of inadequate monitoring.

The Marketplace availability program

I ran the Marketplace Seller Central availability program for years. This means I was responsible for reporting on (among other things), the availability of tools for Sellers using Seller Central.

I learned early on to not rely exclusively on our metrics to identify issues. We had a tool available which listed every single error message Sellers received on the website. Considering the volume of Sellers and website usage, we couldn't have a human look at every Seller error. However, I made a habit of personally investigating a few random errors every week when I wrote my weekly operations report.

It was an excellent learning process, particularly as a manager, to investigate individual Seller issues. I would click on the error, and view the error the Seller received. I would then use our internal tools to trace backwards through what specifically went wrong.

In one example I remember, I saw an error on one specific Seller tool. Let's pretend it was the “view recent shipments” tool. I saw a fairly generic error message, but I also saw that the page had taken exactly 30 seconds to load. Thirty seconds is a suspicious amount of time, because even numbers often means human involvement. Thirty seconds indicated a timeout value.

I was suspicious, so I searched through the logs for that page, and found a long list of Sellers with 30 second page load times on that specific tool. Yet more suspicion.

I looked at the list of Sellers who encountered the problem. They were all fairly large Sellers. Larger than the average Seller. This combined with hitting a timeout value almost certainly meant that this was an issue related to volume. Something wasn't scaling.

I contacted the team responsible for the tool, and explained the situation. With the transaction IDs I could provide and the patterns I explained, it took them only a few minutes to realize what was happening.

This specific tool could not handle Sellers with thousands of recent shipments. It was not built to handle large scale. This lead us to question why we didn't see the scaling issue in our metrics, as it should be breaking more often. We looked at the usage trends, and talked to a few Sellers about it. It turns out that Sellers had realized awhile ago that for large businesses, the tool didn't work. There were alternative ways of viewing shipments, so large Sellers had built a habit of not using the tool.

Our customers had adapted to our problems, but because of that adaptation, we never saw the behavior in our metrics. Only by looking at anecdotes were we able to find a relatively serious issue, and resolve it.

This post is for paying subscribers only

Sign up now and upgrade your account to read the post and get access to the full library of posts for paying subscribers only.

Sign up now Already have an account? Sign in