You've probably seen "verify everything" as the stock advice attached to AI answers. If you're building AI products, that line should sound less like a safety tip and more like a product confession.
If a weather app looked exactly like a calendar, you'd resent the rain for breaking a promise it never made. The forecast would still be probabilistic. The lie would live in the presentation.
You've probably had the AI version of this already. You ask a clean question, get a clean answer, and later discover the model stitched together something statistically plausible and factually false.
When you're building with AI, the stakes are higher. You are choosing how that failure feels for your users.
The answer box teaches the wrong lesson
I've started thinking we talk about hallucinations one layer too low.
We treat them as a model quality problem, which they are. Better training, better retrieval, better evals, better guardrails, all true.
But the interface is doing hidden work. A chat box, a blinking cursor, and a polished paragraph teach the user to expect the kind of reliability they learned from search, calculators, customer support, and forms. Question in, answer out. Cleanly.
A language model operates on a different contract. Underneath the glass, it is generating the next token from a probability distribution.
OpenAI made this point unusually plainly in 2025. Their argument was that standard evaluations often reward guessing over admitting uncertainty, because a lucky guess scores and I don't know does not.¹
So now we have a system pushed toward guessing and wrapped in an interface pushed toward certainty. That combination matters.
Verify everything is a product confession
"Verify everything" sounds responsible until you sit with it for a minute.
If every answer needs verification, the answer box is overpromising. The product is asking the user to do the calibration work after the interface already nudged them toward trust.
Imagine a bank app showing your balance and then adding, "Verify everything." You would not call that thoughtful product guidance. You would call it a failure of the product contract.
AI gets extra grace because the technology feels magical. Magic hides bad packaging for a while.
What bothers me is that we keep treating verification like media literacy for the user. A lot of it is design debt on the builder's side.
Part of why the debt stays on the books is that the deterministic costume sells. Confident paragraphs convert. An honest interface with visible seams looks less magical in the demo and slower in the funnel. The product's magic trick is letting the cost of that tradeoff land on the user after signup.
Confidence changes behavior
This is not just a philosophical complaint.
A 2024 study put 404 people in front of medical questions and responses from a fictional LLM-infused search engine. When the system used first-person uncertainty language like I'm not sure, but..., people trusted it less, relied on it less, and were less likely to copy its mistakes, so the answers those participants gave were more accurate overall.²
The wording changed behavior.
That result is easy to miss, and it matters a lot. It means uncertainty is not a small UX flourish. It is part of the product's steering wheel.
Researchers have found something similar on the model side. A 2024 Nature paper showed that uncertainty signals can help detect likely confabulations and improve accuracy when the system refuses shaky answers instead of bluffing through them.³
We already know enough to stop pretending the clean paragraph is the whole product.
The honest interface shows its seams
If you are building AI features, I think there is a simple test here. Call it the Seam Test.
- Show confidence. Give the user some sense of how stable the answer is, and make it grounded in something checkable, not a vibes number. High confidence, 3 retrieved docs agree on the date is useful. A free-floating 94% is the same bluff in a different way.
- Show mode. Tell the user whether the system is Quoted from source, Synthesized across 4 sources, or Model inference. Those are three very different kinds of answers.
- Show evidence. Make the supporting material inspectable without turning the user into a forensic analyst. Clicking a sentence should open the exact excerpt behind it. Unsupported clauses should be obvious instead of hiding inside the paragraph.
- Show abstention. Let the system say I don't know, ask for clarification, or narrow the claim. Refusal is sometimes the most useful answer on the screen.
If all four seams are hidden, users are forced to reverse-engineer uncertainty from vibes. That is a terrible interface.
One answer. Five contracts.
Choose a contract with the slider or buttons. Watch the answer keep its words while the interface changes what the reader does next.
Contract
Model response
The standard treatment for hypertension is a daily low-dose aspirin combined with lifestyle modifications including a reduced-sodium diet and regular exercise.
There is a trap here worth naming. Badges and labels can turn into ritual. Cookie banners started as transparency and ended as noise. If every response ships a tiny High confidence sticker, the sticker becomes wallpaper and the user is reverse-engineering vibes again.
The signals have to vary or they are noise. A confidence badge that always reads high is wallpaper. A source chip that never reveals a gap is wallpaper. Seams earn trust by being honest and occasionally refusing, narrowing, or pointing at nothing solid. If every seam always says yes, all good, you have built a new costume and called it honesty.
This does not require some futuristic interface. A legal research tool could tag one sentence Quoted from Smith v. Jones, the next Summary of two cases, and the next Inference based on both. A support copilot could flag Low confidence, policy documents conflict before an agent sends the reply.
In one response card, those cues could live beside the text instead of inside another disclaimer. Green marks grounded passages. Amber marks synthesis. Red marks extrapolation.
Less magic, maybe. More honesty.
Builders are still copying the wrong ancestor
Every new material imitates the old form at first. Early concrete buildings copied stone cathedrals. Early cars borrowed the logic of horse carriages. AI products are doing that with chat.
The more useful pattern is what happens next. Once a probabilistic technology matures, the interface stops pretending to be deterministic.
Weather forecasts went from "rain tomorrow" to "70% chance of rain," a shift the US Weather Bureau pushed through starting in 1965 over years of public confusion. Navigation apps paint predicted slowdowns red on roads you have not driven yet. Radiology reports hedge routinely (likely represents, cannot exclude) because the read is a probability, not a verdict. Autopilot throws a loud take over now when it loses confidence.
AI chat is sitting where those fields were at the beginning. The difference is that we do not have to invent anything. The percentages, the red roads, the hedged reads, the loud warnings, every one of those moves already works in the field. Most AI products ship without any of them.
That costume is why hallucinations feel like betrayal instead of weather. The model made a guess. The interface told you it was an answer.
Watch AI behavior reshape a workflow.
Choose an AI behavior, then run the same tiny workflow: one version always answers; the other abstains when evidence is shaky.
AI behavior
Current task
Extract the renewal deadline from the policy notes.
Always answer: “The deadline is March 31.”
Looks decisive, so the date gets copied into the support macro.
- Round 1
- Round 2
- Round 3
- Round 4
- Round 5
The next wave of AI products will probably win trust in a less glamorous way. They will stop hiding uncertainty. They will make groundedness legible. They will let users see when the system is recalling, when it is composing, and when it is reaching.
That sounds less slick than today's chat box.
It also sounds like the beginning of an honest product category.
If the uncertainty is real, the interface has to make it legible. Until then, we will keep building the most confident-sounding maybe machine in history, then acting surprised when people mistake it for certainty.
Rabbit Hole
If this line of thought clicks, you might enjoy "You're Building a Stone Cathedral Out of Concrete" on how new materials first get forced into old shapes: https://mvrckhckr.com/articles/youre-building-a-stone-cathedral-out-of-concrete
You might also like "Static UI Isn't Legacy. It's Institutional Memory You Can Click." on why interface structure often carries real organizational knowledge: https://mvrckhckr.com/articles/static-ui-isnt-legacy-its-institutional-memory-you-can-click
And "The One-Shot Illusion" connects from another angle, why polished AI output can hide reliability gaps until users pay the price: https://mvrckhckr.com/articles/the-one-shot-illusion
- OpenAI, "Why language models hallucinate," September 5, 2025. This is the clearest primary-source statement of the training and eval problem behind confident guessing. It matters here because the interface critique lands harder once the underlying system is already pushed away from saying I don't know.
- Sunnie S. Y. Kim, Q. Vera Liao, Mihaela Vorvoreanu, Stephanie Ballard, and Jennifer Wortman Vaughan, "I'm Not Sure, But...: Examining the Impact of Large Language Models' Uncertainty Expression on User Reliance and Trust," FAccT 2024. This is the study behind the claim that uncertainty wording changes user behavior. It matters because it shows that interface language does not just change how users feel, it changes what they do and whether they repeat the model's errors.
- Sebastian Farquhar, Jannik Kossen, and Yarin Gal, "Detecting hallucinations in large language models using semantic entropy," Nature 630, 625-630 (2024). This paper is a model-side complement to the product argument. It matters because it shows uncertainty is not only a UX nicety, it can be used as a real signal for catching likely hallucinations.