Nate Meyvis

Another reason we can't measure our productivity with AI

Serious poker players want to know their results, and they want to know how meaningful those results are. So they start tracking their winrates, and they do the computations to figure out how many hands they need for those rates to be statistically significant. That tends to be a big number, so they play for a long time, until they've exceeded that number, and then they look at their winrate.

The problem is that by the time they think their numbers are statistically significant, things have changed: they've hopefully improved (it's usually serious, studying players who are bothering to track these numbers carefully), and even if they haven't, the game conditions have been changing underneath them. There are some exceptional cases, but usually, by the time you can measure something, that thing isn't your winrate any more. The rate is useful data, but properly assessing your play always requires other kinds of analysis.

As a programmer, I can adopt new AI tools and try to measure their effects on their output, by whatever metrics might be available. But even putting aside the fact that metrics never fully describe software quality, tools are changing so fast that we're in the poker player's situation. By the time we could gather those metrics, conditions will have changed. Models and their behavior change, not only between releases but also within a release.1 We change, especially because we're all so new to agentic engineering. As these tools get more powerful, even what we want to measure changes.

This is another reason that, as Zvi says, benchmarks have never been less useful. More practically, though, it means that assessing ourselves will have more to do with introspection and case study, and less to do with measuring output. The latter is still important: if you find yourself with lots more bugs, don't ignore that. When we're evaluating our work, however, we've never been at more risk of measuring something that no longer exists by the time we've measured it.


  1. Major providers sometimes have temporary, degraded performance, and the services we use undergo continuous maintenance and improvement, so even if we're using the same model name and interface to it, our conditions are often changing.

#generative AI #psychology of software #software