Skip to main content

4 posts tagged with "Observability"

Tracing, monitoring, and debugging AI agents in production — spans, drift detection, and end-to-end visibility.

View All Tags

Detecting Agent Defects & Drift in Production

· 21 min read
Vadim Nicolai
Senior Software Engineer

Your production sales agent has not crashed. There are no error logs and no timeouts. Yet something is off. The agent still sounds fluent and still follows the script, but its trajectories have grown longer and its tool calls more repetitive. This is where agent defect detection and drift monitoring in production begin to matter, because agent defects are not classical code bugs. They are behavioral discrepancies between what the developer's control logic expects and what the model actually produces. The 2026 study "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) makes the scale concrete. It mined 13,602 issues from 40 repositories, sampled 385 faults, and validated its taxonomy with 145 developers.

Autonomy is the whole subject here. This article is the capstone of a series — The Autonomous Sales Fleet — that built one production system across ten installments, adding exactly one capability per article as one real graph, each step climbing an autonomy ladder that runs from rep-assist up to self-directed plan→act→verify loops. Every rung of that ladder is a grant of trust, and every grant can decay. Defect and drift detection is the guardrail that makes autonomy durable rather than a one-time gift: it is the continuous check that an agent promoted up the ladder has not quietly slid back down it in production.

That durability is the point a per-run pass/fail can never deliver on its own. An agent that earns the right to act without a human in the loop only keeps that right if something watches for the slow degradation no single run reveals. The monitor in this article is that watcher — it reads finished traces, flags the wandering tool loops and drifted personas that keep an agent looking fluent while it stops doing its job, and routes the failures back to the human gate that granted the autonomy in the first place. Catch the defect per run, catch the drift across runs, and the fleet can hold its autonomy instead of silently forfeiting it.

LangGraph v3 Event Streaming: Typed Projections Over a Content-Block Protocol

· 13 min read
Vadim Nicolai
Senior Software Engineer

Streaming an LLM to a user is easy. Consuming the stream on the server — token deltas, reasoning deltas, tool-call chunks, per-node state, subgraph events, usage metadata — is the part that turns into a pile of if chunk["type"] == ... branches. I shipped a streaming endpoint last week on LangGraph version="v2", because that is what's installed (1.1.8 locally, 1.2.4 on the server). The hand-rolled consumer was about twenty lines of fragile branching, a keepalive hack to stop a proxy from dropping the connection during DeepSeek's silent reasoning phase, and a manual accumulator that reset whenever langgraph_node changed.

LangGraph's version="v3" event-streaming API is what I'd reach for next, and the diff is the interesting part: it deletes most of that parsing. Instead of one undifferentiated event firehose you branch on, v3 gives you typed, per-channel projections you iterate independently, built on a content-block protocol that makes text, reasoning, tool-call, and multimodal boundaries explicit. v1 and v2 are unchanged. This is a walk through what v3 actually is, what it removes from your code, and where it still leaves work for you.

Observable AI Memory: mem0, LangGraph, and Qdrant with Enterprise-Grade Telemetry

· 13 min read
Vadim Nicolai
Senior Software Engineer

Most "AI memory" demos stop at memory.add() and memory.search(). That works on a laptop. It does not survive contact with production. The real questions are: When this recall is slow, which store is to blame? When a graph's spend triples overnight, which feature caused it? When a customer asks "what did your agent remember about me, and when?", can you answer from an audit log instead of a shrug?

TL;DR — This field report shows how to build an agent memory layer where every operation honors a contract: fail-open, PII-safe, and fully instrumented. Three stores (mem0, Qdrant, LangGraph) are funneled through single chokepoints, and each chokepoint fans out to five telemetry sinks. The result is a stack that answers the hard production questions without guesswork.

BMAD Method + Langfuse + Claude Code Agent Teams in Production

· 16 min read
Vadim Nicolai
Senior Software Engineer

Running AI agents in a real codebase means solving three intertwined problems at once: planning and quality gates (so agents don't drift), observability (so you know what's working), and orchestration (so multiple agents divide work without clobbering each other). In nomadically.work — a remote EU job board with an AI classification and skill-extraction pipeline — these problems are solved by three complementary systems: BMAD v6, Langfuse, and Claude Code Agent Teams. This article explains how each works and how they compose.