Coverage tooling for AI
Some, but not all, traditional software engineering tools will have new, AI-optimized analogues. Consider coverage tooling: not test coverage, where you measure how much of your code your tests touch,1 but production coverage, where you measure how often different parts of your code actually run in production.2
It's much easier to build a simple version of this for AI coding than it is to build the analogous traditional tools. Claude supports hooks, and one useful "coverage" metric is: did Claude read a given Markdown file?
It took me approximately two minutes3 to discuss and implement this tool with Claude. The (very simple) structure I'm trying is:
- Making a little SQLite table of Markdown-file-opening events that's visible systemwide;
- Adding a Claude hook (again, systemwide) that inserts a row whenever it opens a Markdown file.
- Adding some convenience tools to pull and summarize the data.
Now I'm just letting it run for a while. It's already given me some useful pointers to "hot" parts of the system, where a "hot" Markdown file is one that's being read often. A few notes on the process:
- I'm working to factor skills and other Markdown files into more granular pieces,4 and this tool will encourage me to do that so that I can get more fine-grained coverage information.
- This is useful in part because I have written some, but not most, of these Markdown files by hand. Almost all have been edited by AI, many were made by AI, and some I've never read all the way through.
- A few obvious responses to seeing that a Markdown file is "hot" is (i) to read it carefully and make sure it's as good as possible, (ii) to make sure that its being hot doesn't indicate a system design failure, and (iii) to consider splitting it up if it is long. We'll see which of these are most useful.
- It's easy to think of ways to extend this--e.g., with more sophisticated scoping / namespacing or by occasionally asking the AI how useful a file was. This points to a meta-skill of using AI: keeping such improvements in mind while also just doing things.
I'm hoping for, and expecting, a lot of cross-pollination between the worlds of traditional software tooling and AI-centric software tooling. This project, however primitive, feels like a step in that direction.
Most commonly, you just measure how many lines of code were exercised while tests were run. This tends to produce a very crude picture of how well you're actually testing your software, but estimating coverage and (especially) identifying coverage gaps can be legitimately useful.↩
This is a bit of an oversimplification, but it's hard to say something more or different: the details vary too much by language and deployment environment.↩
(Literally.)↩