The Seam Test: Four Signals Every AI Interface Should Expose
Updated
Knowledge on this page was mainly distilled from AI Hallucinations Start at the Interface.
The Seam Test is a four-point checklist for evaluating whether an AI interface honestly communicates the nature of its output. When all four seams are hidden, users are forced to reverse-engineer uncertainty from vibes.
The Four Seams
- Confidence. Give the user a sense of how stable the answer is, grounded in something checkable. "High confidence, 3 retrieved docs agree on the date" is useful. A free-floating 94% is the same bluff in a different costume.
- Mode. Tell the user whether the response is quoted from a source, synthesized across multiple sources, or generated by model inference. These are fundamentally different kinds of answers.
- Evidence. Make supporting material inspectable without turning the user into a forensic analyst. Clicking a sentence should open the exact excerpt behind it. Unsupported clauses should be visually obvious.
- Abstention. Let the system say "I don't know," ask for clarification, or narrow its claim. Refusal is sometimes the most useful answer on the screen.
Signals Must Vary or They Become Noise
Badges and labels can decay into ritual. If every response ships a "High confidence" sticker, the sticker becomes wallpaper. A source chip that never reveals a gap is wallpaper. Seams earn trust by being honest and occasionally refusing, narrowing, or pointing at nothing solid.
Q&A
What is the Seam Test for AI interfaces?
It is a four-point check asking whether an AI interface exposes confidence, mode, evidence, and abstention. Each seam gives the user a different kind of information about how reliable and grounded the output is. If all four are hidden, the interface is forcing users to guess at reliability, which is how hallucinations feel like betrayal instead of expected uncertainty.
What does showing mode mean in practice?
Mode labeling tells users whether a given passage is directly quoted from a source, synthesized across multiple sources, or generated as model inference. A legal research tool might tag one sentence 'Quoted from Smith v. Jones,' the next 'Summary of two cases,' and the next 'Inference based on both.' These are three fundamentally different reliability levels that look identical in a plain paragraph.
How do you show confidence without faking precision?
Ground the confidence signal in something the user can verify. 'Three retrieved documents agree on this date' is meaningful. A floating percentage like 94% can imply false precision if it is not tied to a legible measurement. The goal is calibration the user can evaluate, not a number that merely looks authoritative.
Why does abstention matter as a design signal?
A 2024 Nature paper showed that uncertainty signals can help detect likely confabulations and improve accuracy when the system refuses shaky answers instead of bluffing through them. Letting the model say 'I don't know' or narrow a claim is not a failure state. It is often the most honest and useful response the product can deliver.
How do you prevent confidence badges from becoming wallpaper?
The signals have to vary in practice. A badge that always reads 'high confidence' stops carrying information, much like cookie banners decayed from transparency into noise. The test is whether the signal occasionally says something the user does not want to hear, like 'low confidence' or 'no supporting source found.' Consistent positivity is just a new costume for the same overconfidence.
What is an example of the Seam Test applied to a real product?
A support copilot could flag 'Low confidence, policy documents conflict' before an agent sends a reply. In the same response card, green marks grounded passages, amber marks synthesis, and red marks extrapolation. No single disclaimer banner is needed because the signals live beside the text they describe.