What is agent drift in production sales agents?

Agent drift is the gradual degradation of an agent's behavior as real conditions diverge from what its logic and prompts assume. The fleet measures it as a population signal — the defect rate rising over a window — not as a single run's failure.

How can I detect defects in a live agent?

Read the trace the stack already emits. The fleet runs deterministic signals first, then 1 fenced judge call for the ambiguous classes, and routes any hard-violation run to a human review queue.

What are the common defect types?

Following "Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents" (arXiv:2412.18371), the fleet monitors tool-entropy wandering, role drift, execution gaps, and structural trajectory anomalies.

How is alert fatigue avoided?

Hard deterministic vetoes are rare and unambiguous. Soft defects only escalate near the 0.80 gate. The whole lane ships in shadow mode behind feature flags, so thresholds tune before any run is auto-held.

9 posts tagged with "Evaluation"

Measuring AI agent and LLM quality with structured metrics, datasets, and automated scoring instead of vibes.

View All Tags

The Four-Component Feedback Loop That Turns a Static Agent Into a Search Problem

July 13, 2026 · 18 min read

Vadim Nicolai

Senior Software Engineer

Most AI agents you deploy today are frozen the moment they go live. You handcraft the prompts, select the tools, wire up the memory, and hope the configuration survives contact with real users. It doesn't. Tasks drift, APIs change, user intents shift – and your agent silently degrades. The conventional fix is another round of manual reconfiguration. But there's a more principled path: treat agent design not as a one-time assembly but as a continuous search problem.

Evolving the Reasoner: How Agents Learn to Optimise Their Own Behaviour and Prompts

July 13, 2026 · 19 min read

Vadim Nicolai

Senior Software Engineer

Most self-evolving agent demonstrations—those that appear to learn by picking better tools or adjusting dialogue style—avoid modifying the core reasoning engine. Evolving the reasoner itself—the chain-of-thought architecture, the internal planning logic, the very way an agent thinks—is the hard, brittle, data-starved problem that separates parlor tricks from genuine lifelong adaptation.

Evolving the Substrate: Optimising What an Agent Remembers and Which Tools It Can Wield

July 13, 2026 · 13 min read

Vadim Nicolai

Senior Software Engineer

Most teams building self-evolving agents obsess over prompt engineering or fine-tuning the LLM. They miss the bigger lever: the substrate—what the agent remembers and which tools it wields. A prompt is ephemeral; memory and tools are structural. Evolving the substrate yields compounding returns that no amount of prompt tweaking can match. Fang et al. (2025) survey of self-evolving agents confirms this: the components that persist across sessions—memory and tools—define the agent's operational range far more than any instruction string. In this third part of the series, I'll lay out why memory and tool optimisation are the neglected backbone of lifelong agent systems, back every claim with data from the literature, and give you a decision framework you can implement today.

You Cannot Benchmark a System That Rewrites Itself

July 13, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

The moment an agent can rewrite its own code, evaluation ceases to measure and starts to train.

Evolving the Team: Multi-Agent Topologies That Rewrite Themselves

July 13, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

Here’s the uncomfortable truth the hype cycle doesn’t want you to hear: a single, well-prompted model often beats an entire team of specialised agents on standard reasoning benchmarks. Pan et al. (2025a) demonstrated that single large LLMs with carefully crafted prompts can match the performance of complex multi-agent discussion frameworks across multiple reasoning tasks arxiv:2508.07407. Jwalapuram et al. (2026) push the finding further: a single-agent GPT-5 instance using chain-of-thought with self-consistency “reliably outperforms the most sophisticated GPT-4o-based MAS frameworks (e.g., ADAS or AFlow) while consuming less than half the total tokens,” and automatically generated multi-agent systems “consistently underperform CoT-SC despite being up to 10x more expensive” arxiv:2606.13003. If you’re building an agent system and your first instinct is “let’s spin up three agents and make them debate,” you might just be burning tokens for no gain.

Detecting Agent Defects & Drift in Production

June 15, 2026 · 21 min read

Vadim Nicolai

Senior Software Engineer

Your production sales agent has not crashed. There are no error logs and no timeouts. Yet something is off. The agent still sounds fluent and still follows the script, but its trajectories have grown longer and its tool calls more repetitive. This is where agent defect detection and drift monitoring in production begin to matter, because agent defects are not classical code bugs. They are behavioral discrepancies between what the developer's control logic expects and what the model actually produces. The 2026 study "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) makes the scale concrete. It mined 13,602 issues from 40 repositories, sampled 385 faults, and validated its taxonomy with 145 developers.

Autonomy is the whole subject here. This article is the capstone of a series — The Autonomous Sales Fleet — that built one production system across ten installments, adding exactly one capability per article as one real graph, each step climbing an autonomy ladder that runs from rep-assist up to self-directed plan→act→verify loops. Every rung of that ladder is a grant of trust, and every grant can decay. Defect and drift detection is the guardrail that makes autonomy durable rather than a one-time gift: it is the continuous check that an agent promoted up the ladder has not quietly slid back down it in production.

That durability is the point a per-run pass/fail can never deliver on its own. An agent that earns the right to act without a human in the loop only keeps that right if something watches for the slow degradation no single run reveals. The monitor in this article is that watcher — it reads finished traces, flags the wandering tool loops and drifted personas that keep an agent looking fluent while it stops doing its job, and routes the failures back to the human gate that granted the autonomy in the first place. Catch the defect per run, catch the drift across runs, and the fleet can hold its autonomy instead of silently forfeiting it.

Agentic CLEAR: Automating Multi-Level Agent Evaluation — and the Autonomy Gate It Unlocks

June 11, 2026 · 17 min read

Vadim Nicolai

Senior Software Engineer

Every team running an agent fleet has the same blind spot. Observability platforms—MLflow, Langfuse, home-grown OpenTelemetry—capture execution traces beautifully. They show you what the agent did. They say almost nothing about whether it did it well. So a developer opens the trace viewer, scrolls through a few hundred spans, and tries to eyeball a systemic failure out of thousands of runs. The research alternative is worse: hand-built error taxonomies that take weeks to annotate and go stale the moment the agent changes. What both approaches lack is automated multi-level agent evaluation—judgment of the trajectory itself, not just a record of it.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents, by Yehudai, Eden, and Shmueli-Scheuer (2026) at IBM Research, attacks exactly this gap. It is an open-source Python package—pip install clear-eval—that reads raw agent traces and produces data-driven evaluation at three levels of granularity, surfaces recurring failure patterns without a predefined taxonomy, and renders the whole thing in an interactive dashboard. It reports up to 0.890 AUC for predicting trajectory success in a fully reference-less setting. This post walks through what the paper actually does, then shows how I wired the same multi-level shape into a 45-graph production fleet as an "autonomy gate"—the component that turns a human approval interrupt into a machine one.

Multi-Modal Evaluation for AI-Generated LEGO Parts: A Production DeepEval Pipeline

March 23, 2026 · 19 min read

Vadim Nicolai

Senior Software Engineer

Your AI pipeline generates a parts list for a LEGO castle MOC. It says you need 12x "Brick 2 x 4" in Light Bluish Gray, 8x "Arch 1 x 4" in Dark Tan, and 4x "Slope 45 2 x 1" in Sand Green. The text looks plausible. But does the part image next to "Arch 1 x 4" actually show an arch? Does the quantity make sense for a castle build? Would this list genuinely help someone source bricks for the build?

These are multi-modal evaluation questions — they span text accuracy, image-text coherence, and practical usefulness. Standard unit tests cannot answer them. This article walks through a production evaluation pipeline built with DeepEval that evaluates AI-generated LEGO parts lists across five axes, using image metrics that most teams haven't touched yet.

The system is real. It runs in Bricks, a LEGO MOC discovery platform built with Next.js 19, LangGraph, and Neon PostgreSQL. The evaluation judge is DeepSeek — not GPT-4o — because you don't need a frontier model to grade your outputs.

Synthetic Evaluation with DeepEval: A Production RAG Testing Framework

March 23, 2026 · 13 min read

Vadim Nicolai

Senior Software Engineer

Your RAG pipeline passes all 20 of your hand-written test questions. It retrieves the right context, generates grounded answers, and the demo looks great. Then it goes to production, and users start asking the 21st question — the one that exposes a retrieval gap, a hallucinated citation, or a context window that silently truncated the most relevant chunk. You had 20 tests for a knowledge base with 55 documents. That's 0.4% coverage. The other 99.6% was untested surface area.

This guide shows how to close that gap. We walk through a production implementation that generates 330+ synthetic test cases from 55 AI engineering lessons, evaluates a LangGraph-based RAG pipeline across 10+ metrics, and runs hyperparameter sweeps to find optimal retrieval configurations — all automated with DeepEval and pytest.