Skip to main content

4 posts tagged with "Evaluation"

Measuring AI agent and LLM quality with structured metrics, datasets, and automated scoring instead of vibes.

View All Tags

Detecting Agent Defects & Drift in Production

· 21 min read
Vadim Nicolai
Senior Software Engineer

Your production sales agent has not crashed. There are no error logs and no timeouts. Yet something is off. The agent still sounds fluent and still follows the script, but its trajectories have grown longer and its tool calls more repetitive. This is where agent defect detection and drift monitoring in production begin to matter, because agent defects are not classical code bugs. They are behavioral discrepancies between what the developer's control logic expects and what the model actually produces. The 2026 study "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) makes the scale concrete. It mined 13,602 issues from 40 repositories, sampled 385 faults, and validated its taxonomy with 145 developers.

Autonomy is the whole subject here. This article is the capstone of a series — The Autonomous Sales Fleet — that built one production system across ten installments, adding exactly one capability per article as one real graph, each step climbing an autonomy ladder that runs from rep-assist up to self-directed plan→act→verify loops. Every rung of that ladder is a grant of trust, and every grant can decay. Defect and drift detection is the guardrail that makes autonomy durable rather than a one-time gift: it is the continuous check that an agent promoted up the ladder has not quietly slid back down it in production.

That durability is the point a per-run pass/fail can never deliver on its own. An agent that earns the right to act without a human in the loop only keeps that right if something watches for the slow degradation no single run reveals. The monitor in this article is that watcher — it reads finished traces, flags the wandering tool loops and drifted personas that keep an agent looking fluent while it stops doing its job, and routes the failures back to the human gate that granted the autonomy in the first place. Catch the defect per run, catch the drift across runs, and the fleet can hold its autonomy instead of silently forfeiting it.

Agentic CLEAR: Automating Multi-Level Agent Evaluation — and the Autonomy Gate It Unlocks

· 17 min read
Vadim Nicolai
Senior Software Engineer

Every team running an agent fleet has the same blind spot. Observability platforms—MLflow, Langfuse, home-grown OpenTelemetry—capture execution traces beautifully. They show you what the agent did. They say almost nothing about whether it did it well. So a developer opens the trace viewer, scrolls through a few hundred spans, and tries to eyeball a systemic failure out of thousands of runs. The research alternative is worse: hand-built error taxonomies that take weeks to annotate and go stale the moment the agent changes. What both approaches lack is automated multi-level agent evaluation—judgment of the trajectory itself, not just a record of it.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents, by Yehudai, Eden, and Shmueli-Scheuer (2026) at IBM Research, attacks exactly this gap. It is an open-source Python package—pip install clear-eval—that reads raw agent traces and produces data-driven evaluation at three levels of granularity, surfaces recurring failure patterns without a predefined taxonomy, and renders the whole thing in an interactive dashboard. It reports up to 0.890 AUC for predicting trajectory success in a fully reference-less setting. This post walks through what the paper actually does, then shows how I wired the same multi-level shape into a 45-graph production fleet as an "autonomy gate"—the component that turns a human approval interrupt into a machine one.

Multi-Modal Evaluation for AI-Generated LEGO Parts: A Production DeepEval Pipeline

· 19 min read
Vadim Nicolai
Senior Software Engineer

Your AI pipeline generates a parts list for a LEGO castle MOC. It says you need 12x "Brick 2 x 4" in Light Bluish Gray, 8x "Arch 1 x 4" in Dark Tan, and 4x "Slope 45 2 x 1" in Sand Green. The text looks plausible. But does the part image next to "Arch 1 x 4" actually show an arch? Does the quantity make sense for a castle build? Would this list genuinely help someone source bricks for the build?

These are multi-modal evaluation questions — they span text accuracy, image-text coherence, and practical usefulness. Standard unit tests cannot answer them. This article walks through a production evaluation pipeline built with DeepEval that evaluates AI-generated LEGO parts lists across five axes, using image metrics that most teams haven't touched yet.

The system is real. It runs in Bricks, a LEGO MOC discovery platform built with Next.js 19, LangGraph, and Neon PostgreSQL. The evaluation judge is DeepSeek — not GPT-4o — because you don't need a frontier model to grade your outputs.

Synthetic Evaluation with DeepEval: A Production RAG Testing Framework

· 13 min read
Vadim Nicolai
Senior Software Engineer

Your RAG pipeline passes all 20 of your hand-written test questions. It retrieves the right context, generates grounded answers, and the demo looks great. Then it goes to production, and users start asking the 21st question — the one that exposes a retrieval gap, a hallucinated citation, or a context window that silently truncated the most relevant chunk. You had 20 tests for a knowledge base with 55 documents. That's 0.4% coverage. The other 99.6% was untested surface area.

This guide shows how to close that gap. We walk through a production implementation that generates 330+ synthetic test cases from 55 AI engineering lessons, evaluates a LangGraph-based RAG pipeline across 10+ metrics, and runs hyperparameter sweeps to find optimal retrieval configurations — all automated with DeepEval and pytest.