Metrics Circularity
Measurement needs to be at arms-length distance from the system being measured
I spent a good portion of my time at Google working in Search Quality, which included working on the search ranking algorithm. I learned early on at Google that the evaluation of ranking algorithms has a long, storied history in the field of Information Retrieval that dates back to at least the 1960s.
Being steeped in Search Quality culture for so long, it surprised me that many experienced data scientists and engineers who weren’t exposed to ranking evaluation either ignored or were unaware of a critical insight: that a good evaluation metric’s definition and implementation must be at arms-length distance from the system that’s being measured. This is to combat a problem we called “metrics circularity.”
To illustrate, suppose that you work on an email product, and begin to suspect that spam might be a problem within your product. This creates a simultaneous pair of questions:
product question: what needs to be done to fix the problem?
measurement question: how do you know you’ve succeeded (or made progress) in fixing it?
At first, the product and measurement questions might seem completely independent, but upon reflection it’s easy to see that they are very interrelated. Having a spam metric means you have a spam classifier (the measurement question). Having a spam classifier is exactly what you want as an engineering solution to identify and eliminate spam (the product question). But if you use the same classifier in your spam fighting system and to measure the prevalence of spam, then your spam metric is completely blind to any form of spam that your system’s spam classifier doesn’t recognize. Or put another way, you have no way of detecting false positives or false negatives.
It’s generally a best practice for an organization to have separate product and evaluation teams, so that the product team isn’t “grading their own homework.” This is where metrics circularity becomes awkward. If you improve spam measurement, then it’s almost always the right decision to implement that improvement into the system, to realize it as a product win that’s good for the user and good for the business. Thus, we’ve seen that any good quality metric eventually becomes used as a quality signal.
The metric circularity issue is in some sense opposite of the incentives-based metric gaming of Goodhart’s Law. Here, the value of the metric is transferred into the product, which nonetheless still reduces the efficacy of the metric. Yet another reason why metrics can have short shelf lives.
A similar phenomenon is in the news today about the AI slop apocalypse (AI slopocalypse?), where LLMs trained on LLM-generated content will cause whatever truly valuable training data to become vanishingly obscured. For this reason, many of us were extremely vigilant in terms of precisely defining LLM-based metrics, and ensuring there were some separation mechanisms between LLMs-as-metrics and LLMs-in-systems. Otherwise, you run the risk of creating “unit test metrics” that only measure whether a product was launched, and not whether it does anything of value. That will be the issue of the next post.
P.S. My experience has been that product-level problems equating to north star metrics tend to have relatively independent solutions to the product and measurement questions. In retrospect, this might be why many data scientists, even very senior ones, who only operated at the strategic and not the tactical levels, weren’t as exposed to metrics circularity.


What is your suggestion about tackling metrics circularity?
Would it work if an independent ds team works on metrics, and the moment that the metrics get absorbed by product team in product changes, the ds team simply moves on to work on the next generation metrics? Even in that case, it's hard to guarantee that the next-gen metrics are separated enough from the earlier-gen metrics, because we seldom create metrics from the scratch; it's almost always built upon a good earlier metric.