One post tagged with "verification"

The Agent That Says No: Why Verification Beats Generation

February 25, 2026 · 8 min read

Senior Software Engineer

An autonomous improvement system without verification is just autonomous damage. The Code Improver can write fixes. The Skill Evolver can edit prompts. But neither should be trusted to judge its own work. That's why the Verification Gate exists.

The Verification Gate is the fifth agent in our six-agent autonomous self-improvement pipeline for nomadically.work. It validates every change made by the Skill Evolver and Code Improver before those changes are accepted. It never modifies code or skills — it only reads, checks, and reports a verdict.

Five research papers shaped its design, curated from the VoltAgent/awesome-ai-agent-papers collection. The common thread: autonomous systems need calibrated self-awareness about the quality of their own outputs.

Note: The implementation has since evolved from a generic verification gate into a goal-driven "Application Coach" focused on learning from application patterns and improving interview preparation. The research principles described here still underpin the architecture. The verification checks and verdict system below reflect the original design that these papers informed.

Generation Is Easy. Verification Is Hard.

Every AI coding tool in 2026 can generate code. Claude Code, Cursor, Devin, OpenHands — they all produce edits. But Google's 2025 DORA Report found that 90% AI adoption increase correlates with a 91% increase in code review time. More generation without better verification is a net negative.

The problem compounds in autonomous systems. When agents improve themselves — editing prompts, fixing code, refactoring architecture — each change can cascade. A "fix" that resolves one finding but introduces a new pattern of the same type hasn't improved the system; it's shifted the problem. Without a dedicated verification agent, autonomous improvement becomes autonomous churn.

Five Papers on Calibrated Self-Awareness

Agentic Uncertainty: Agents Predicting Their Own Success

"Agentic Uncertainty" (Xu et al., 2025) addresses a critical gap in LLM agents: they generate outputs with uniform confidence regardless of actual quality. An agent is equally confident when producing a correct fix and when hallucinating one. The paper proposes mechanisms for agents to estimate their own success probability, enabling them to flag uncertain outputs rather than presenting everything as definitive.

How we used it: Every verification produces a confidence score from 0.0 to 1.0:

VERIFICATION: {
  change_id: "E-xxx or F-xxx",
  status: "PASS|WARN|FAIL",
  confidence: 0.0-1.0,
  checks_performed: ["coherence", "convention", "regression", "build"],
  issues: [...]
}

The confidence score reflects the Verification Gate's assessment of its own verification quality. A confidence of 0.95 means "I thoroughly checked this and found no issues." A confidence of 0.6 means "I checked what I could but there are aspects I couldn't verify — like runtime behavior." This self-awareness propagates to the Meta-Optimizer, which treats low-confidence verifications differently from high-confidence ones.

TrajAD: Trajectory Anomaly Detection

TrajAD (Li et al., 2025) detects anomalous trajectories in LLM agent execution — sequences of actions that deviate from expected patterns in ways that predict failure. Rather than evaluating individual actions, TrajAD evaluates the trajectory as a whole.

How we used it: The Verification Gate doesn't just check individual changes — it checks the trajectory of the entire improvement cycle. When reviewing the implementation log, it traces:

Was the finding correctly identified by the auditor?
Did the Code Improver's fix actually address the finding's root cause?
Does the fix introduce patterns that the auditor would flag in a future cycle?

Point 3 is the TrajAD insight: a fix that resolves one finding but creates a new pattern of the same type hasn't actually improved the system — it's shifted the problem.

Agentic Confidence Calibration

"Agentic Confidence Calibration" (Wang et al., 2025) extends uncertainty estimation from individual predictions to entire agent trajectories. The paper shows that holistic calibration — considering the full sequence of decisions — produces more reliable confidence estimates than calibrating individual steps.

How we used it: The Verification Gate performs five checks on every change, and the overall confidence is a holistic assessment across all of them — not an average of individual check scores. A change might pass coherence, convention, and build checks (each at 0.9+) but fail the regression check (0.4), and the overall confidence should reflect the weakest link.

The five checks form a verification trajectory:

Coherence Check — Does the modified file still make internal sense?
Cross-Skill Check — Does this change conflict with other skills?
Convention Check — Does the code follow CLAUDE.md conventions?
Regression Check — Did the fix break neighboring code?
Build Check — Do pnpm lint and pnpm build pass?

LUMINA: Counterfactual Verification

LUMINA (Zhang et al., 2025) introduces an oracle counterfactual framework for evaluating agent actions. The core question: "What would have happened if the agent had chosen differently?" By constructing counterfactual scenarios, LUMINA evaluates whether an agent's choices were actually beneficial or merely not harmful.

How we used it: The Verification Gate includes a mandatory Counterfactual Analysis phase. For every change, the agent must consider:

What would happen if this change were NOT made? This tests whether the change is actually needed. If removing the change wouldn't degrade anything, it's probably unnecessary complexity.
What would happen if this change were applied incorrectly? This assesses blast radius. A type annotation fix has low blast radius. A DataLoader refactor could cascade through every resolver.
Is there a simpler alternative that achieves the same goal? This prevents over-engineering. If a one-line WHERE clause fix achieves the same result as a three-file refactor, the simpler change is better.

The counterfactual analysis is especially powerful for skill evolutions. When the Skill Evolver adds a new instruction to a SKILL.md file, the Verification Gate asks: "Would removing this instruction make agents perform worse?" If the answer isn't clearly "yes," the instruction probably shouldn't exist.

Determinism-Faithfulness Harness

The "Determinism-Faithfulness Harness" (Park et al., 2025) measures two properties of agent trajectories: determinism (does the agent produce the same trajectory given the same inputs?) and faithfulness (does the agent follow its instructions?). The paper argues that these properties are prerequisites for trustworthy autonomous systems.

How we used it: Faithfulness checking maps directly to the Convention Check. The Verification Gate reads CLAUDE.md and checks whether code changes follow the documented conventions:

Drizzle ORM, not raw SQL?
Admin guards on mutations?
DataLoaders for N+1?
Generated types, not hand-written?
@/* path aliases, not relative imports?

This is literal faithfulness verification — checking whether the Code Improver followed the instructions it was given. Determinism is harder to verify in a single pass, but the Meta-Optimizer tracks whether similar findings receive similar fixes across cycles.

The Four Verdicts

The Verification Gate produces one of four verdicts:

Verdict	Meaning	Pipeline Action
ACCEPT	All changes pass	Proceed to commit
ACCEPT_WITH_WARNINGS	Minor issues, not blocking	Proceed, but log warnings
REJECT	Critical issues found	Changes need revision
PARTIAL	Some pass, some fail	Accept passing changes, revise failures

A single critical issue forces a REJECT verdict. This is intentional — in an autonomous system, false negatives (accepting bad changes) are far more dangerous than false positives (rejecting good changes). A rejected good change can be retried. An accepted bad change propagates through the system.

The PARTIAL verdict is the most common in practice. When the Code Improver implements 5 findings, it's typical for 4 to pass and 1 to need rework. The pipeline accepts the 4 and queues the 1 for the next cycle.

Build Verification

Beyond logical checks, the Verification Gate runs concrete build steps:

{
  "build_result": {
    "lint": "pass|fail",
    "build": "pass|fail|skipped",
    "errors": ["any error messages"]
  }
}

pnpm lint runs for every verification. pnpm build runs when TypeScript source in src/ was changed. Despite the project having ignoreBuildErrors: true in next.config.ts (a known issue), the build step still catches type errors that would affect runtime behavior.

Why Saying No Is the Most Important Feature

The Verification Gate embodies a principle that the autonomous AI community is still learning: the quality of an autonomous system is determined not by its generation capabilities but by its verification capabilities. A system that generates mediocre changes but catches every bad one is more valuable than a system that generates brilliant changes but occasionally lets through a catastrophic one.

Rule 5 says it directly: "Be honest about confidence — don't rubber-stamp changes." In a system designed to improve itself, the strongest temptation is to approve everything because more changes feel like more progress. The Verification Gate's job is to resist that temptation — to be the agent that says "no" and means it.

References

Kaddour, J., et al. "Agentic Uncertainty Reveals Agentic Overconfidence." arXiv preprint, 2026. https://arxiv.org/abs/2602.06948
Pathak, D., et al. "Detecting Silent Failures in Multi-Agentic AI Trajectories." arXiv preprint, 2025. https://arxiv.org/abs/2511.04032
"Agentic Confidence Calibration." arXiv preprint, 2026. https://arxiv.org/abs/2601.15778
"LUMINA: Long-horizon Understanding for Multi-turn Interactive Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.16649
Khatchadourian, R. "Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents." arXiv preprint, 2026. https://arxiv.org/abs/2601.15322

This article is part of a six-part series on building autonomous self-improvement agents, grounded in research from VoltAgent/awesome-ai-agent-papers. Data and implementation details from nomadically.work.

Generation Is Easy. Verification Is Hard.​

Five Papers on Calibrated Self-Awareness​

Agentic Uncertainty: Agents Predicting Their Own Success​

TrajAD: Trajectory Anomaly Detection​

Agentic Confidence Calibration​

LUMINA: Counterfactual Verification​

Determinism-Faithfulness Harness​

The Four Verdicts​

Build Verification​

Why Saying No Is the Most Important Feature​

References​