Measuring AI Model Quality by Intervention Frequency
Updated
Knowledge on this page was mainly distilled from The Problem With “Open Models Are Last Year’s Frontier”.
Count the Interruptions, Not the Capabilities
When comparing AI models, the instinct is to ask what each model can do. A more revealing question is how often you have to intervene: restating instructions, correcting tone, catching hallucinations, or patching generated code. That intervention count is the real cost of using a model, and it rarely appears in benchmark tables.
A model that requires frequent babysitting pulls your attention out of the problem and into managing the tool. The result is broken focus and lost momentum, costs that do not show up in token pricing but show up directly in time-to-ship.
The Disappearing Tool Test
When a model is good enough for a task, you forget it exists. You stay in the problem. When it is not, you start repeating constraints, rephrasing prompts three ways, and reading every output with suspicion. That shift from thinking about the work to thinking about the tool is the clearest signal that the model is below your threshold.
Q&A
What is intervention frequency as a model quality metric?
It is the count of how many times you need to restate an instruction, correct the output, catch a hallucination, or manually fix generated code during a working session. Unlike benchmark scores, it captures the lived friction of using a model. A model that scores well on benchmarks but requires constant correction may cost more in practice than a higher-priced model that gets things right the first time.
Why is intervention frequency better than benchmarks for practical comparison?
Benchmarks measure capability on fixed tasks, but working sessions involve a stream of small moments where a model can fail in subtle, varied ways. Intervention frequency captures the cumulative friction of those small failures. Two models can score similarly on a benchmark yet feel very different because one distributes its errors in ways that break your flow more often.
How do you track intervention frequency in practice?
A simple method is to keep a tally during a focused work block: each time you rephrase a prompt, manually correct output, or re-read something with suspicion, mark it. After a session, compare tallies across models on similar tasks. Even rough counts reveal patterns that token costs and benchmark tables miss.
What is the 'disappearing tool' signal?
When a model is well-matched to your task, you stop noticing it and stay focused on the problem itself. When it is not, your attention keeps getting pulled toward managing the tool. This shift from problem-focus to tool-focus is the clearest experiential indicator that a model falls below your working threshold, regardless of its published capabilities.
How does intervention frequency relate to cost comparisons?
Dollars per million tokens is easy to compare, but minutes per shipped feature is harder and more honest. A cheaper model that requires ten micro-corrections per task may cost more in real time and cognitive load than a pricier model that nails the task on the first attempt. The second number decides whether the tool actually pays for itself.