What is agent trajectory observability?

Agent trajectory observability is the practice of capturing, tracing, and analyzing the complete sequence of steps an AI agent takes to complete a task, including tool calls, LLM reasoning steps, intermediate outputs, and error states. It answers how an agent reached its answer, not just what it answered.

How is agent trajectory observability different from traditional observability?

Traditional observability focuses on system-level metrics, logs, and traces for monolithic or microservice applications, tracking request latency, error rates, and CPU usage. Agent trajectory observability instead tracks LLM reasoning chains, tool selection decisions, multi-step orchestration paths, and semantic accuracy of outputs — answering why the agent made a decision at the reasoning level.

How does a trajectory judge score an agent without credentials?

Two deterministic checks run with zero credentials: a redundancy check that flags the same tool called twice with the same query, corpus, and filters, and an appropriate-abstention check that rewards saying 'insufficient grounding' when all retrievals came back empty rather than hallucinating. An optional LLM leg refines the 0–1 score but never overrides the deterministic gates.

Why upload traces to Langfuse over raw REST instead of an SDK?

The uploader batches trace lines into Langfuse's public REST ingestion endpoint using stdlib urllib, with no SDK dependency, so the same REST shape mirrors across the Python and TypeScript codebases and both no-op cleanly when keys are unset. A per-directory ledger tracks uploaded lines, making reruns idempotent and partial uploads resumable, and uploads never block the serving path.

When should you not use trajectory evaluation?

Do not apply trajectory evals to stateless QA agents that do not make sequential decisions — answer-level evals work fine there. Trajectory evals earn their cost exactly where agency lives: agents with two or more tool calls in the typical path, agents that spawn sub-agents, or agents with failures that cannot be reproduced from the same input due to non-deterministic tool selection.

What is agent drift in production sales agents?

Agent drift is the gradual degradation of an agent's behavior as real conditions diverge from what its logic and prompts assume. The fleet measures it as a population signal — the defect rate rising over a window — not as a single run's failure.

How can I detect defects in a live agent?

Read the trace the stack already emits. The fleet runs deterministic signals first, then 1 fenced judge call for the ambiguous classes, and routes any hard-violation run to a human review queue.

What are the common defect types?

Following "Defining and Detecting the Defects of the Large Language Model-based Autonomous Agents" (arXiv:2412.18371), the fleet monitors tool-entropy wandering, role drift, execution gaps, and structural trajectory anomalies.

How is alert fatigue avoided?

Hard deterministic vetoes are rare and unambiguous. Soft defects only escalate near the 0.80 gate. The whole lane ships in shadow mode behind feature flags, so thresholds tune before any run is auto-held.

5 posts tagged with "Observability"

Tracing, monitoring, and debugging AI agents in production — spans, drift detection, and end-to-end visibility.

View All Tags

Agent Trajectory Observability: Judge the Path, Not Just the Answer

July 6, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

Two agents answer the same user query. Both return the identical string—correct, well-formatted, cited. An answer-level eval gives them both a perfect score, identical down to the decimal.

One agent made three redundant retrieval calls (same tool, same query, same corpus) before stumbling on the right source. The other called exactly the right tool once and answered. The answer-level eval cannot tell the difference. It never could.

The keys are in the trajectory.

I built a trajectory observability lane for my agents in three small pieces: the JSONL traces every workflow already emits but nobody reads, a judge that scores the tool-call sequence instead of the answer, and a Langfuse uploader written against the raw REST ingestion API—no SDK. Publication volumes indicate this is the moment: agent-observability research jumped sharply into 2026 (the phrase barely existed before), and the first dedicated fault-detection benchmark for agent observability was published this week.

This post is the full walkthrough: what trajectory observability is, why answer-level evals miss half the story, the three-module build, and how the research on partial observability validates the approach.

Detecting Agent Defects & Drift in Production

June 15, 2026 · 21 min read

Vadim Nicolai

Senior Software Engineer

Your production sales agent has not crashed. There are no error logs and no timeouts. Yet something is off. The agent still sounds fluent and still follows the script, but its trajectories have grown longer and its tool calls more repetitive. This is where agent defect detection and drift monitoring in production begin to matter, because agent defects are not classical code bugs. They are behavioral discrepancies between what the developer's control logic expects and what the model actually produces. The 2026 study "Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes" (arXiv:2603.06847) makes the scale concrete. It mined 13,602 issues from 40 repositories, sampled 385 faults, and validated its taxonomy with 145 developers.

Autonomy is the whole subject here. This article is the capstone of a series — The Autonomous Sales Fleet — that built one production system across ten installments, adding exactly one capability per article as one real graph, each step climbing an autonomy ladder that runs from rep-assist up to self-directed plan→act→verify loops. Every rung of that ladder is a grant of trust, and every grant can decay. Defect and drift detection is the guardrail that makes autonomy durable rather than a one-time gift: it is the continuous check that an agent promoted up the ladder has not quietly slid back down it in production.

That durability is the point a per-run pass/fail can never deliver on its own. An agent that earns the right to act without a human in the loop only keeps that right if something watches for the slow degradation no single run reveals. The monitor in this article is that watcher — it reads finished traces, flags the wandering tool loops and drifted personas that keep an agent looking fluent while it stops doing its job, and routes the failures back to the human gate that granted the autonomy in the first place. Catch the defect per run, catch the drift across runs, and the fleet can hold its autonomy instead of silently forfeiting it.

LangGraph v3 Event Streaming: Typed Projections Over a Content-Block Protocol

June 3, 2026 · 13 min read

Vadim Nicolai

Senior Software Engineer

Streaming an LLM to a user is easy. Consuming the stream on the server — token deltas, reasoning deltas, tool-call chunks, per-node state, subgraph events, usage metadata — is the part that turns into a pile of if chunk["type"] == ... branches. I shipped a streaming endpoint last week on LangGraph version="v2", because that is what's installed (1.1.8 locally, 1.2.4 on the server). The hand-rolled consumer was about twenty lines of fragile branching, a keepalive hack to stop a proxy from dropping the connection during DeepSeek's silent reasoning phase, and a manual accumulator that reset whenever langgraph_node changed.

LangGraph's version="v3" event-streaming API is what I'd reach for next, and the diff is the interesting part: it deletes most of that parsing. Instead of one undifferentiated event firehose you branch on, v3 gives you typed, per-channel projections you iterate independently, built on a content-block protocol that makes text, reasoning, tool-call, and multimodal boundaries explicit. v1 and v2 are unchanged. This is a walk through what v3 actually is, what it removes from your code, and where it still leaves work for you.

Observable AI Memory: mem0, LangGraph, and Qdrant with Enterprise-Grade Telemetry

June 2, 2026 · 13 min read

Vadim Nicolai

Senior Software Engineer

Most "AI memory" demos stop at memory.add() and memory.search(). That works on a laptop. It does not survive contact with production. The real questions are: When this recall is slow, which store is to blame? When a graph's spend triples overnight, which feature caused it? When a customer asks "what did your agent remember about me, and when?", can you answer from an audit log instead of a shrug?

TL;DR — This field report shows how to build an agent memory layer where every operation honors a contract: fail-open, PII-safe, and fully instrumented. Three stores (mem0, Qdrant, LangGraph) are funneled through single chokepoints, and each chokepoint fans out to five telemetry sinks. The result is a stack that answers the hard production questions without guesswork.

BMAD Method + Langfuse + Claude Code Agent Teams in Production

February 23, 2026 · 16 min read

Vadim Nicolai

Senior Software Engineer

Running AI agents in a real codebase means solving three intertwined problems at once: planning and quality gates (so agents don't drift), observability (so you know what's working), and orchestration (so multiple agents divide work without clobbering each other). In nomadically.work — a remote EU job board with an AI classification and skill-extraction pipeline — these problems are solved by three complementary systems: BMAD v6, Langfuse, and Claude Code Agent Teams. This article explains how each works and how they compose.