When the AI is usually good enough
Here are some things I've read and noticed this week:
- Simon Willison reports that AI is working "just fine" in "any existing codebase." (I agree.)
- The models are getting better and better.
- I've been using Gemini Flash 3 a lot: it's fast and usually does the job well.
The cumulative effect is of my worrying less about whether the model can do X--if I'm asking, it usually can!--and more about other things: speed, cost, convenience, style, or whatever else. This should change how we work:
- The time we spend researching tools be spent less on studying benchmarks: many of the scores are good enough, and (as Zvi argues) the important differences are likelier and likelier to consist in things that cannot be measured1 and must instead be experienced and considered.
- We should think about speed and ergonomics more when we're choosing our tools.
- We should, I think, compare several models' approaches to the same problem even more often: if I'm not sure whether (say) Claude got it right, the relevant question is less likely to be "can I squeeze a bit more quality out of this version of Opus?" and more likely to be "am I in a blind spot that Codex or Gemini will see immediately?"
I'm sure there are many other good adjustments to be made. Perhaps, also, we will adjust by giving the AI harder and harder problems, but this isn't obvious to me. We might just get better and better at feeding AI-tractable problems to the AI. But that's another post.
...or, at least, cannot yet be measured.↩