We're frequently encouraged to trust our co-workers. Rodolfo says that his program hit their revenue targets, and it's socially expected that we congratulate Rodolfo. Saying, "Prove it, Rodolfo!" could certainly rub people the wrong way.
However, that's what I'm going to suggest you do. Not specifically to Rodolfo, my invented Spanish product manager who loves Paella and Real Madrid. Moreover, not the rude and aggressive declaration. However, I do want you to model the skepticism behind that statement. You should be skeptical. I'm going to specifically cover why you should question metrics, processes, and assumptions.
Why is it important to be skeptical? Because we build on top of those unquestioned things. We prioritize projects because our metrics say it's a good idea. We spend weeks executing detailed process steps, with the belief that the added time and expense is essential. We change our entire organization's trajectory based on some critical assumptions.
Our job as leaders is to keep a healthy skeptical mind. If we see that we're building upon something we haven't recently verified, then it's our duty to question it.
Be skeptical of metrics
We drive many aspects of our companies on metrics. We watch error rates to determine if there's a customer impacting problem. We watch order rates to observe if our sales are working. We watch activity metrics to understand customer behavior.
However, considering the power and influence of these metrics, putting too much trust in our metrics can be dangerous.
At one point at Amazon, I was responsible for the operations of a wide variety of features for Marketplace Sellers. Functions from managing their orders, to updating inventory levels, to customer service.
While creating a pretty report for our leadership team to explain the error rates of various features, I stack ranked all major features by their error rates. Two major functions stood out. One had an extremely high error rate, around 5x the normal error rate of features. The other had an almost zero error rate, which was impressive for such a high volume (and complex feature).
What was particularly interesting was that the latter feature had the highest error rate in my last report. Somehow they'd dropped like a rock! Quite impressive. I knew they'd get quite a bit of credit for it once I released my report.
But I was skeptical. You know what I'm super skeptical of? Outsized success. It seemed too good to be true. I know it's a character flaw, but it's served me well over the years.
So, I looked at the data. How did I approach data verification in this case?
- I looked at specific errors which showed up in my report, reading the actual error log. That enabled me to recognize what was reported as errors.
- I looked at the latency of page loads because very fast or very slow page loads often tell an interesting story about the customer journey.
- I looked at page load logs, to see what normal successful page loads look like.
- I used some web browser tricks to load components of the feature one at a time.
What was the result?
Well, it turns out that last time I'd sent out my report, their manager had been chided pretty harshly by an executive. That manager had asked an engineer on their team to quickly get the error rate under control before my next report was sent out.
The engineer, with or without their manager's knowledge (I'm not sure), decided to achieve their manager's request in the fastest way possible. When any error happened on their pages, they identified the error, quietly recorded it in a non-error log, and then moved on. They stopped reporting errors as "feature errors" because they wanted to avoid being called out on those errors. Essentially, they just hid their errors from other teams.
What's worse, is that because their perceived error rate had dropped so dramatically, there was zero institutional pressure to resolve the real problems. Therefore, they had no engineers working on their real customer issues any longer.
It's the technical version of cooking the books. It was somewhere deep into dishonest and dangerous territory. And it was only identified because I didn't blindly trust their metrics.
How did I deal with it? I quietly went to the manager, explained that their solution wasn't acceptable on many levels, and they needed to quickly revert their change. I'd re-run our report, and send the honest report to our leadership. They'd have to explain away the high error rates, but wouldn't need to explain their dishonesty. Getting someone in big trouble (or fired) over a bad judgement call didn't feel quite right. Unless it came up again (which it didn't).
That other feature, the one with a really high error rate? As long as I was investigating data, I looked into their error rates. It turns out that they were being hit by an internal availability script (checking if their feature was up) by many machines every second of the day. Every one of those queries threw an error. The actual customer-facing error rates were much lower. We fixed their reporting (and those scripts), and they disappeared from the top of that report.
Turns out that multiple aspects of that data and report were flawed.
To question metrics in general, here are a few specific approaches.
Look for random anecdotes - You have errors? Look at 10 instances of those errors (stack traces, error logs). See if those errors are what you expected. It's amazing how often this identifies something unexpected. Only 10 anecdotes is very often enough to identify something strange. You have strange customer behavior? Look at a sample of those customers.
Look for your outliers - You have high latency? Look at a log of your 10 slowest page loads. See what's taking all the time, and you'll often learn something. What's your fastest page loads? Look at those as well. Questioning the best results often leads to learning.
Look for corroborating metrics - Your revenue shows an increase of 20% over the last week? How much did the loads of your checkout page grow? Your new feature metrics say that they drove an incremental 10% revenue over the last month? What was your monthly growth in comparison to previous months? Did you see that incremental revenue? Try to think of how various metrics should interact, and see if they're all telling the same story. When they inevitably don't line up (as they often don't), find the explanation to that story. It may be a very important story to learn.
I once did that exact research after an advertising campaign confidently proclaimed that they drove a sizeable increase in revenue. Too many leaders happily accepted their claim as gospel, and approved an increase in the advertising budget. However, I was able to bring the discussion back to the table once I pointed out that there didn't appear to be any correlation between our sales / revenue, and our advertising spend. We learned quite a bit about how we were tracking ad success on that day.