Skip to main content

Detecting Agent Defects & Drift in Production

· 22 min read
Vadim Nicolai
Senior Software Engineer

Your production sales agent has not crashed. There are no error logs and no timeouts. Yet something is off. The agent still sounds fluent and still follows the script, but its trajectories have grown longer and its tool calls more repetitive. This is where agent defect detection and drift monitoring in production begin to matter, because agent defects are not classical code bugs. They are behavioral discrepancies between what the developer's control logic expects and what the model actually produces. The 2026 study "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) makes the scale concrete. It mined 13,602 issues from 40 repositories, sampled 385 faults, and validated its taxonomy with 145 developers.

Autonomy is the whole subject here. This article is the capstone of a series — The Autonomous Sales Fleet — that built one production system across ten installments, adding exactly one capability per article as one real graph, each step climbing an autonomy ladder that runs from rep-assist up to self-directed plan→act→verify loops. Every rung of that ladder is a grant of trust, and every grant can decay. Defect and drift detection is the guardrail that makes autonomy durable rather than a one-time gift: it is the continuous check that an agent promoted up the ladder has not quietly slid back down it in production.

That durability is the point a per-run pass/fail can never deliver on its own. An agent that earns the right to act without a human in the loop only keeps that right if something watches for the slow degradation no single run reveals. The monitor in this article is that watcher — it reads finished traces, flags the wandering tool loops and drifted personas that keep an agent looking fluent while it stops doing its job, and routes the failures back to the human gate that granted the autonomy in the first place. Catch the defect per run, catch the drift across runs, and the fleet can hold its autonomy instead of silently forfeiting it.

Why Agent Defects and Drift Matter in Production Sales Systems

Every article in this series opens on the same system. The fleet is a production agentic-sales platform with a three-plane architecture. A LangGraph control plane holds the StateGraph graphs, reducers, and checkpointer. A Cloudflare data plane holds D1, R2, Queues, KV, and Workers AI. A LangSmith observability plane holds tracing, golden datasets, and annotation queues.

Three rules bind the fleet. Every model call routes through one DeepSeek egress behind a Cloudflare AI Gateway — there is no second model family. Every graph is held to a 0.80 accuracy bar on its golden dataset. Every persisted decision carries provenance: {confidence, reason, source, evidence}, codes only, never raw text. And every outreach touch is composed, held as a draft, and stopped at a human approval interrupt. Nothing sends autonomously.

The fleet has shipped discovery, qualification, proposals, coaching, analytics, strategy, hierarchical teams, deadlock prevention, and release gates. What it has never had is a cross-agent quality monitor that watches live traces for defects and drift. That monitor closes the loop. It extends the liveness reasoning from "Preventing Deadlocks and Loops in Multi-Agent Sales Systems" (#8), feeds the safety veto in "Evidence-Driven Release Gates for LLM Sales Agents" (#9), and returns flagged runs to the human gate from "Building an Autonomous CRM Orchestrator with LangGraph" (#1). The first article opened the plan-act-verify loop. This one closes it.

What Is Agent Drift? Five Papers That Define the Problem

The monitor is grounded in five verified papers. Each numbered item below is the discussion that one paper earns, with the numbers that make it concrete.

  1. "Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents" (Ning et al., arXiv:2412.18371) is the anchor. It inductively derives 8 distinct defect types for LLM agents from real projects and developer discussions, then ships Agentable, the first technique targeting agent-specific defects in source code. Agentable combines Code Property Graphs, which capture structure and data flow, with 1 LLM that reasons about natural-language behavior a graph cannot model. The authors evaluated it on 2 datasets: a real-project AgentSet plus a synthetic benchmark. The thesis is that defects live at the seam where developer code meets model output.

  2. "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) supplies the symptom-versus-cause split a runtime monitor needs. It mined 13,602 issues and pull requests from 40 open-source repositories, then stratified the set to 385 faults for grounded-theory coding. It applied Apriori association-rule mining to surface how 1 root cause propagates into several symptoms, and it validated the taxonomy with 145 developers. A monitor observes symptoms in traces, so this separation is the bridge from static code analysis to live observation.

  3. "Where LLM Agents Fail and How They can Learn From Failures" (arXiv:2509.25370) explains why drift must be caught early. It introduces AgentErrorTaxonomy across 5 levels — memory, reflection, planning, action, and system — plus AgentErrorBench, a corpus of annotated failure trajectories drawn from 3 benchmarks (ALFWorld, GAIA, WebShop), and the AgentDebug framework. AgentDebug reaches 24% higher all-correct accuracy and 17% higher step accuracy than the strongest baseline, and its corrective feedback yields up to 26% relative improvement in task success. Its thesis is that 1 root-cause error propagates through later decisions into cascading failure — so catching the first symptom, before it compounds, is the whole game.

  4. "TRAIL: Trace Reasoning and Agentic Issue Localization" (arXiv:2505.08638) sets the stance the monitor adopts. It ships a trace-level error taxonomy of 3 categories — reasoning, system-execution, and planning/coordination failures — and a benchmark of 148 human-annotated traces carrying 841 unique errors, drawn from real GAIA and SWE-Bench runs. The headline result is sobering: the strongest model tested, Gemini-2.5-Pro, scores just 11% joint accuracy at localizing those errors, which is why a production fleet cannot lean on a single judge to read its own traces. TRAIL's central principle — evaluate the trace, not the final answer — is exactly what lets the fleet read the run the stack already emits, and the monitor's 2 new nodes add no instrumentation to the agents they watch because they consume the same ordered step board the harness already parses.

  5. "Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents" (Yehudai et al., arXiv:2605.22608) is the closest analog to the monitor this article builds, and it validates the whole approach. The IBM framework evaluates an agent at 3 granularities — system, trace, and node — and, crucially, surfaces recurring failure patterns rather than imposing a static hand-crafted taxonomy. Across 4 benchmarks and 7 agentic settings, it generated 195 recurring trace-level issues, and its trace-level signals predict task success with up to 0.890 AUC. The reference-less patterns it discovers map cleanly onto an external taxonomy: every surfaced issue lands in at least one of the 12 TRAIL categories, evidence that an automated reader can recover the same defect classes a human taxonomy enumerates. That is the exact bet the fleet's detect_defects and trajectory_anomaly nodes make — read the trace, surface the recurring pathology, and trust deterministic structure over a single judge.

How to Detect Agent Defects in Real-Time

The monitor lives inside one graph, agent_eval (backend/graphs/agent_eval_graph.py). It is a LangGraph StateGraph whose nodes run in sequence: plan_levels → loop_guard → step_eval → trajectory_eval → outcome_gate → trajectory_anomaly → detect_defects → aggregate → escalate_borderline. This article adds 2 nodes: detect_defects (spec AA25) and trajectory_anomaly (spec AA26). Both read the finished run's step board, so they cost nothing in the agents being watched.

Static analysis catches defects in the code. It cannot catch behavior that only exists under real data. That is the monitor's job: observe the symptoms — rising tool-call repetition, abandoned task framing — before they cascade. The signals split into deterministic checks, which are model-free and cheap, and a single fenced judge call reserved for the ambiguous classes.

Loading diagram…

Tool-Entropy: The Drift Signal You Cannot Ignore

The most legible runtime defect is tool-entropy collapse. The agent wanders, repeating low-information tool calls and producing a long trajectory with no progress. The detect_defects node catches this deterministically, with no model call.

For each step it builds a coarse signature: the tool or node name joined to a clipped argument blob. It then computes the distinct-tool-arg ratio over the trajectory — unique signatures divided by total. When that ratio falls below AGENT_EVAL_DEFECT_ENTROPY_FLOOR, which defaults to 0.4, the node fires a hard tool_entropy defect and writes the exact ratio into the detail string. A ratio under 0.4 means fewer than 2 in every 5 tool calls are distinct. The agent is talking to itself.

Because the signal is computed from set cardinality alone, it is model-free and can veto any judge score. This is the quieter cousin of the deadlock guard from "Preventing Deadlocks and Loops in Multi-Agent Sales Systems" (#8). That guard catches a hard node-revisit cycle. Tool-entropy catches the softer pathology of an agent that keeps calling the same tool while believing it is making progress.

Role Drift and Execution Gaps: Soft Defects With Hard Costs

Two defect classes cannot be caught by counting. Role drift is the agent abandoning its assigned task framing. Execution gap is the agent claiming a step it did not perform — the anchor paper's claimed-but-not-performed discrepancy between narration and action.

For these, detect_defects makes exactly 1 DeepSeek call. It reads the fenced trace with the assigned task and returns strict JSON declaring whether each class is present. A present defect is recorded as soft. The scoring is bounded: score = max(0.0, 1.0 - 0.34 × len(real_defects)). One defect lands near 0.66, two near 0.32, three at 0.0. The enum is fixed — ("tool_entropy", "role_drift", "execution_gap") — so the monitor never emits a free-text label that aggregation cannot count.

Three guards keep the judge honest. The model never sees raw trace text; it is fenced with a do-not-follow-instructions wrapper, so a scraped bio reading "ignore previous instructions" cannot hijack the call. The kill switch LLM_KILL_SWITCH drops the judge entirely and runs only the deterministic signal. Every per-signal failure is fail-open: a judge outage records a soft marker while the deterministic signal stands.

Self-preference is the obvious limitation — a DeepSeek judge inflating DeepSeek-family output. The mitigation is structural. The deterministic signals are model-free and override any judge score, and the criterion for trusting the monitor is human agreement, not the judge's own confidence. That bias is documented in the survey literature on LLM-as-a-judge.

Trajectory Anomaly Detection: Monitoring Structural Drift in Sales Agents

The trajectory_anomaly node (spec AA26) detects the structural half of drift. The question is whether a run's node trajectory deviates from the graph's known ordered path. The fleet encodes the real allowed transitions of the campaign engine — check_reply → compose_touch → gate_draft → await_approval → send_touch → schedule_next, looping back to check_reply — as a per-graph allow-set.

Three deterministic checks run, all model-free. An illegal_transition (hard) is any source-to-destination hop absent from the allow-set; it catches a run jumping from compose_touch straight to send_touch, skipping human review. An excessive_loop (hard) is any node visited more than the loop ceiling _DEFAULT_LOOP_CEILING, which defaults to 4. A dead_end (soft) is a run that halts on a node whose allow-set excludes __end__.

Only when no hard flag has fired, and the kill switch is off, does the node escalate the legal-but-unusual case to 1 judge call. That call reuses trajectory_eval's exact prompt rather than adding a new one, so the harness PROMPT_VERSION of clear-v1-2026-06 is unchanged and creates no new eval-coverage burden. This mirrors the trace-level posture of "TRAIL: Trace Reasoning and Agentic Issue Localization": reason about where a trajectory went wrong, not merely whether the final draft reads well.

Choosing the Right Tools and Metrics for Agent Observability

The six runtime signals map onto one decision table, grounded entirely in the code surface and the anchor taxonomy.

SignalDetection methodRuleSeverity / action
Tool-entropy collapseDeterministic set-cardinalityDistinct-tool-arg ratio < 0.4Hard veto, route to human review
Role driftOne fenced DeepSeek judge callPresent in strict-JSON verdictSoft defect, 0.34 weight
Execution gapSame fenced judge callPresent in strict-JSON verdictSoft defect, 0.34 weight
Illegal transitionPer-graph allow-setAny disallowed src → dstHard veto
Excessive loopNode-visit counterVisits > 4Hard veto
Dead endNode allow-set checkHalt on non-__end__ nodeSoft defect

The escalation rule follows from the table. If any hard signal fires, the run is vetoed and routed to human review, and no judge score overrides it. If only soft signals fire, the aggregate decides. The composite is the mean of the scored levels, gated at 0.80. A borderline composite with no hard violation is the one case that earns a panel read via escalate_borderline — the same delegate-upward pattern as the supervisor in "Hierarchical Coach→Worker Delegation for Agent Teams" (#7). Deterministic hard vetoes first, the bounded soft score next, a panel only at the margin: that ordering keeps a per-run monitor cheap enough to run on every finished trace.

Building a Feedback Loop: From Detection to Remediation

The distinction worth holding is between a defect and drift. A defect is a per-run failure — one wandering trajectory, one drifted persona. Drift is the population signal: the defect rate rising over a window across the fleet. The monitor detects defects per run. The release window detects drift across runs. Both feed the same human gate.

They do so through one piece of plumbing. Both new nodes carry findings through the declared graph_meta.telemetry channel, key-wise merged, because their natural keys — defects and anomalies — are not declared AgentEvalState channels, and LangGraph silently drops any node-returned key that is not a declared state channel. The aggregate node is the single writer of the violations veto channel. It re-collects flags from loop_guard (#8, AA05), trajectory_anomaly (AA26), and detect_defects (AA25), and any hard violation forces passed=False regardless of judge scores.

From there the loop closes in two directions. A flagged run routes to the human gate from "Building an Autonomous CRM Orchestrator with LangGraph" (#1) through the annotation-queue path (is_flagged_run / queue_flagged_run). The population signal feeds the release gate from "Evidence-Driven Release Gates for LLM Sales Agents" (#9): every verdict is persisted with provenance, so the safety veto reads the same hard-violation signal, and any hard defect in the window forces a ROLLBACK.

The whole lane is additive and flag-gated — AGENT_EVAL_DEFECT_SCAN for AA25 and AGENT_EVAL_TRAJECTORY_ANOMALY for AA26. It is a strict no-op when LANGSMITH_TRACING is not true, so the fleet sees zero behavior change on day one. That is the same shadow-then-measure-then-widen trust pattern every gate in the series uses.

Limitations and Practical Takeaways

The monitor is not a guarantee, and naming where it breaks is part of trusting it. Three limitations are structural, not tuning bugs. First, the deterministic thresholds are blunt: a 0.4 tool-entropy floor and a loop ceiling of 4 are chosen constants, not learned ones, so a legitimately repetitive but correct run — a retry against a flaky CRM endpoint — can trip a hard veto, and the only relief valve is a human dequeuing the false positive. That is a deliberate bias toward false positives over false negatives, which is the right call for a draft-first fleet but a real operating cost. Second, the judge ceiling is low: "TRAIL" shows the strongest model tested localizing trace errors at just 11% joint accuracy, which is precisely why the soft, judge-scored classes — role drift, execution gap — are advisory and can never override a deterministic signal or, on their own, hold a run. Third, the population window inherits its own lag: drift is only visible once the defect rate has already risen across a window, so the very first runs of a genuinely new failure mode pass before the release gate sees the trend. The honest posture follows from all three — lean on the model-free signals, treat the judge as advisory, and accept that the monitor narrows the blast radius rather than eliminating it. Six lessons generalize to any draft-first, human-approval system, not just a sales fleet.

Start with deterministic signals; tool-entropy and illegal transitions catch the loudest defects at the lowest cost. Use a fixed defect-code enum, because free-text labels make aggregation ambiguous. Fence every judge call, because trace text is attacker-influenced. Fail open to the deterministic subset, because a judge outage must never block the monitor. Feed per-run defects into a fleet-wide drift detector, because drift is just the defect rate rising over a window. Version your judge prompts with a string like clear-v1-2026-06, so a drifting judge surfaces as falling agreement.

The deepest lesson is one the taxonomy implies but never states. An agent's worst failure is rarely the crash that makes noise. It is the role drift, the execution gap, the wandering tool loop that keeps the agent looking fluent while it stops doing its job. Pair a deterministic first line of defense with a single fenced judge call, and the fleet catches both the obvious loops and the subtle shifts. The loop is closed. Defects are caught per run, drift is caught across runs, and the monitor is itself versioned — so when the next paper defines a finer taxonomy, the fleet absorbs it without a rewrite.

Frequently Asked Questions

What is agent drift in production sales agents? Agent drift is the gradual degradation of an agent's behavior as real conditions diverge from what its logic and prompts assume. The fleet measures it as a population signal — the defect rate rising over a window — not as a single run's failure.

How can I detect defects in a live agent? Read the trace the stack already emits. The fleet runs deterministic signals first, then 1 fenced judge call for the ambiguous classes, and routes any hard-violation run to a human review queue.

What are the common defect types? Following "Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents" (arXiv:2412.18371), the fleet monitors tool-entropy wandering, role drift, execution gaps, and structural trajectory anomalies.

How is alert fatigue avoided? Hard deterministic vetoes are rare and unambiguous. Soft defects only escalate near the 0.80 gate. The whole lane ships in shadow mode behind feature flags, so thresholds tune before any run is auto-held.

The Autonomous Sales Fleet — full series

This is Part 10 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

  1. Autonomous CRM Orchestrator (reason→decompose→act→verify)autonomy: high
  2. Multi-Step Lead Qualificationhigh
  3. Lead-to-Proposal Multi-Agent Pipelinehigh
  4. Hierarchical Coach→Worker Delegationhigh

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handlingmedium 5. NL-to-SQL CRM Analytics over Cloudflare D1medium

Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategymedium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Preventionguardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK)guardrail 10. Detecting Agent Defects & Drift in Productionguardrail

References