I keep hearing a comforting claim in AI circles: today’s open-weight models are basically the frontier models from six to twelve months ago.
That might even be true in some benchmark sense, or on some task mixes. But it feels false in the way that matters most, because it ignores a simple human trait. Once I taste “much better”, it’s hard to go back.
The actionable trick I keep coming back to is simple: when I compare models, I don’t start by asking what they can do. I start by counting how often I have to intervene. How many times do I need to restate the instruction, correct the tone, catch a hallucination, or patch the code it just wrote? That number is the real price.
On paper, “equivalent to last year’s frontier” is a claim about capability. It’s a statement like “this camera from 2016 takes photos as sharp as that camera from 2017.” It’s tempting, because it sounds measurable.
But the lived experience of using a model isn’t a single score. It’s a stream of tiny moments. A model doesn’t fail once. It fails in little ways, with a certain frequency and a certain style.
One model misses an instruction and apologizes. Another misses it and doubles down with confidence. One writes code that runs but looks like it was assembled from five Stack Overflow tabs. Another writes code that feels like it was written by someone who has opinions and taste.
Those differences are hard to compress into a number. They show up as friction in the middle of thinking. They break rhythm.
In music, the gap between a decent speaker and a great speaker isn’t “it plays the notes.” Both play the notes. The gap is that with the great one, I stop noticing the equipment and start noticing the song.
That’s what better models do. They disappear more often.
Models can be likened to libraries. If a library has a function, it has it. The rest is mostly syntactic sugar.
But a model is closer to an interface, the kind of interface that reshapes the person using it. It becomes part of my attention loop. It changes what feels normal.
After a few days with a much better model, my internal standard moves. The older one starts to feel less like “slightly weaker” and more like “why am I babysitting this?”
This is the part that makes the “six to twelve months behind” framing misleading. The distance is measured in time, but the pain is measured in taste.
Paul Graham has a line of thought I keep returning to: taste is this hard-to-explain ability to recognize what’s good, and it’s one of the main things that separates makers who do great work from makers who do adequate work.¹ When taste improves, older work starts looking worse. The work didn’t change. The eyes did.
That same taste shift is why The Skill AI Can't Replace argues that judgment is getting more important, not less.
Models do the same thing to me. They train my eye for what “good” output looks like.
So why does going back feel so bad?
Part of it is simple contrast. If I stare at a sharp photo and then look at a blurry one, the blur is louder. If I only ever look at blurry photos, blur feels normal.
But there’s also a deeper mechanism: people often evaluate outcomes relative to a reference point, and losses tend to loom larger than gains.² Once my reference point becomes “the model that usually nails the instruction”, using a model that nails it only sometimes doesn’t feel like a neutral downgrade. It can feel like a loss.
This is why the “it’s almost as good” argument rarely lands after the upgrade happens. “Almost as good” is an argument from the outside. “I feel the loss on every third prompt” is an argument from the inside.
Which means the real debate isn’t open versus closed, or frontier versus non-frontier. It’s about where my reference point sits, and how quickly it moves.
The funny thing is, the AI community already has evaluation approaches that admit this problem.
One of the more visible public signals for model quality is pairwise preference, asking people which output they prefer rather than asking whether the output matches a fixed answer.³ ⁴ That’s basically taste, scaled up. It’s not perfect, but it acknowledges that the difference between models often shows up as “I like this one better” before it shows up as “this one is categorically more capable.”
If evaluation leans on preference, then “equivalence” starts to look strange. Two models can be close on a set of tasks and still be far apart in how they feel to use, because the gap is in the distribution of annoyances.
One model might be wrong with grace. Another might be wrong with swagger. Those are not the same product.
I think this is why model debates get weird among builders.
A builder rarely needs a model that is “the best” in some abstract sense. A builder needs a model that keeps momentum alive.
Momentum is fragile. It’s not ruined by one big failure. It’s ruined by ten tiny interruptions that each demand context switching. If I have to keep re-anchoring the conversation, I’m not thinking about the product anymore. I’m thinking about the model.
This is also why cost comparisons get slippery. It’s easy to compare dollars per million tokens. It’s harder to compare minutes per shipped feature. The second number is the one that decides whether the tool paid for itself.
That gap between "it works" and "it actually helps me ship" is the same trap I describe in The One-Shot Illusion.
So when someone says, “This open-weight model is basically last year’s frontier,” the question I want to ask is: last year’s frontier for what kind of work?
Writing a quick email? Probably fine. Debugging a subtle bug with four moving parts? The model that keeps the thread might be worth a lot more than its token price.
I don’t have a perfect method. I mostly notice a tell.
When a model is good enough for the work, I forget it exists. I stay in the problem.
When it’s not, I start doing this subtle, annoying kind of babysitting. I repeat constraints. I rephrase the same instruction three different ways. I ask for the same thing, but “shorter” or “more specific” or “with fewer assumptions.” I read everything with suspicion.
That babysitting is the real cost. It doesn’t show up in tokens. It shows up as broken focus.
So my practical test ends up being human: how often do I feel my attention get yanked out of the task and into managing the tool? If that happens a lot, the model isn’t “equivalent” for me, even if it looks close on paper. If it barely happens, I stop caring what tier it’s in.
The original “six to twelve months behind” claim is trying to be optimistic about open models. I get the impulse. I like the direction.
But the hidden conclusion isn’t “open models are catching up.” The hidden conclusion is “the reference point keeps moving.”
If the reference point keeps moving, then the question becomes less about the calendar and more about taste. How fast does my taste upgrade? How fast does the frontier upgrade? How often do I sample the best thing available, even briefly, and reset my baseline?
That’s why it’s hard to go back. It’s not because the older thing is unusable. It’s because I trained myself to see what’s missing.
So here’s the weird, slightly uncomfortable thought I’m left with: the real scarce resource isn’t access to models. It’s a stable definition of “good enough” that survives long enough for me to finish something.
Because every time I taste a much better model, I reset the bar. Suddenly the old bar feels insulting. Then “shipping” quietly turns into “shopping for output quality.”
The tool didn’t get worse. My taste got sharper. That’s a gift, but it also means I have to decide when I’m upgrading my standards, and when I’m just trying to get to done.
If I treated model upgrades the way I treat gear upgrades, I’d do them less often and more deliberately. Not because I don’t want better, but because I want finished.
Rabbit Hole
- Paul Graham, “Taste for Makers” (2002). This essay is one of the clearest explanations I’ve seen of taste as a real faculty, and why better taste changes what “good” looks like.
- Nobel Prize, “The 2002 Prize in Economic Sciences - Popular information” (2002). This summary states that people are more sensitive to deviations from a reference level (often the status quo), and that most people are more averse to losses relative to that reference level than they are pleased by gains of the same size. It’s a clean explanation for why “downgrades” feel sharper than “upgrades.”
- Wei-Lin Chiang et al., “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference” (LMSYS, 2023). This describes a widely used evaluation approach based on pairwise human preferences, which matters because it captures “feel” differences that fixed-answer benchmarks often miss.
- Lianmin Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena” (arXiv:2306.05685, 2023). The paper details MT-Bench and Chatbot Arena methods and reports that model-based judging correlates with human preference on their benchmarks, which helps explain why preference-based evaluation became such a strong proxy signal.