Skip to main content
Vadim Nicolai
Senior Software Engineer

Senior Software Engineer building AI-powered products with TypeScript, Rust, and LLMs. Writes about AI agents, eval-driven development, and edge computing.

View all authors

Agentic CLEAR: Automating Multi-Level Agent Evaluation — and the Autonomy Gate It Unlocks

· 17 min read
Vadim Nicolai
Senior Software Engineer

Every team running an agent fleet has the same blind spot. Observability platforms—MLflow, Langfuse, home-grown OpenTelemetry—capture execution traces beautifully. They show you what the agent did. They say almost nothing about whether it did it well. So a developer opens the trace viewer, scrolls through a few hundred spans, and tries to eyeball a systemic failure out of thousands of runs. The research alternative is worse: hand-built error taxonomies that take weeks to annotate and go stale the moment the agent changes. What both approaches lack is automated multi-level agent evaluation—judgment of the trajectory itself, not just a record of it.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents, by Yehudai, Eden, and Shmueli-Scheuer (2026) at IBM Research, attacks exactly this gap. It is an open-source Python package—pip install clear-eval—that reads raw agent traces and produces data-driven evaluation at three levels of granularity, surfaces recurring failure patterns without a predefined taxonomy, and renders the whole thing in an interactive dashboard. It reports up to 0.890 AUC for predicting trajectory success in a fully reference-less setting. This post walks through what the paper actually does, then shows how I wired the same multi-level shape into a 45-graph production fleet as an "autonomy gate"—the component that turns a human approval interrupt into a machine one.

Knowledge-Graph RAG for Explainable Lead and Account Recommendation

· 11 min read
Vadim Nicolai
Senior Software Engineer

If your first instinct on hearing "knowledge graph" is to reach for Neo4j, you may be over-engineering a lead recommendation system. The past year's research on combining knowledge graphs with retrieval-augmented generation (KG-RAG) for recommendations converges on a pragmatic insight: the most effective KG is often the one you already have. In this design, that is a normalized relational schema of companies, contacts, opportunities, and emails living in a Cloudflare D1 database. Traversing those foreign keys at query time, bounding the fan-out, and feeding the resulting subgraph into an LLM can produce recommendations that are grounded by construction — every explanation path is required to trace back to a real row in the operational store.

LLM Lead Conversion-Propensity Scoring for B2B Lead Prioritization

· 12 min read
Vadim Nicolai
Senior Software Engineer

The published literature on lead scoring converges on a couple of recurring findings. A B2B feature-importance analysis identified lead source and lead status as the most predictive conversion features (Frontiers in AI, 2025). And a supervised classifier trained on labelled outcomes tends to beat both rule-based heuristics and manual qualification. Yet many B2B teams deploying an LLM for lead prioritisation skip the classifier, skip the labelled outcomes, and instead ask the model to reason its way to a score from contact evidence. Is that defensible, or is it cargo-cult AI?

LLM Sales-Email Intent Scoring for Inbound Lead Prioritization

· 10 min read
Vadim Nicolai
Senior Software Engineer

A practical LLM-based intent-scoring design can do exactly one thing: make a single call to a language model, read a few floating-point scores, and fall back to a keyword heuristic if the model fails. No multi-agent orchestration. No fine-tuned BERT. No LightGBM ensemble. And according to the 2026 literature, an LLM semantic scorer outperforms keyword-based intent detection (Sanjei et al., 2026). The useful insight is that an effective design for sales-email intent scoring can also be one of the simplest — a bounded, schema-constrained LLM step embedded inside an existing dataflow graph, designed to fail open rather than cascade errors downstream. This article unpacks why that design is attractive, what the research actually says, and how to build it without over-engineering.

Durable Execution in LangGraph: Agents That Survive Failure and Resume Where They Left Off

· 12 min read
Vadim Nicolai
Senior Software Engineer

Most AI agents are built as a single process holding state in memory: a while loop, local variables, maybe a sleep(). That holds up until the workflow has to outlive the process that started it — and in production it always does. The math is unforgiving: chain ten steps that each succeed 85% of the time and the whole run finishes only about 20% of the time (0.85¹⁰ ≈ 0.20). Without durability, every one of those failures restarts from scratch. The model might be reliable; the tool calls aren't. Better LLMs don't fix network failures — only durable execution does.

The research consensus is that the infrastructure around the model, not the model itself, is where production agents live. The 2026 design-space analysis Dive into Claude Code found that only 1.6% of Claude Code's codebase is AI decision logic; the other 98.4% is operational infrastructure for context management, tool routing, and recovery. LangGraph's answer to that reality is durable execution through its persistence layer — making the agent a row in a checkpoint store, not a stack frame in a living process. This article dissects how that works, the sharp edges it creates, and how to observe a workflow that — by design — no longer runs as a single process.

LangGraph v3 Event Streaming: Typed Projections Over a Content-Block Protocol

· 13 min read
Vadim Nicolai
Senior Software Engineer

Streaming an LLM to a user is easy. Consuming the stream on the server — token deltas, reasoning deltas, tool-call chunks, per-node state, subgraph events, usage metadata — is the part that turns into a pile of if chunk["type"] == ... branches. I shipped a streaming endpoint last week on LangGraph version="v2", because that is what's installed (1.1.8 locally, 1.2.4 on the server). The hand-rolled consumer was about twenty lines of fragile branching, a keepalive hack to stop a proxy from dropping the connection during DeepSeek's silent reasoning phase, and a manual accumulator that reset whenever langgraph_node changed.

LangGraph's version="v3" event-streaming API is what I'd reach for next, and the diff is the interesting part: it deletes most of that parsing. Instead of one undifferentiated event firehose you branch on, v3 gives you typed, per-channel projections you iterate independently, built on a content-block protocol that makes text, reasoning, tool-call, and multimodal boundaries explicit. v1 and v2 are unchanged. This is a walk through what v3 actually is, what it removes from your code, and where it still leaves work for you.

Observable AI Memory: mem0, LangGraph, and Qdrant with Enterprise-Grade Telemetry

· 13 min read
Vadim Nicolai
Senior Software Engineer

Most "AI memory" demos stop at memory.add() and memory.search(). That works on a laptop. It does not survive contact with production. The real questions are: When this recall is slow, which store is to blame? When a graph's spend triples overnight, which feature caused it? When a customer asks "what did your agent remember about me, and when?", can you answer from an audit log instead of a shrug?

TL;DR — This field report shows how to build an agent memory layer where every operation honors a contract: fail-open, PII-safe, and fully instrumented. Three stores (mem0, Qdrant, LangGraph) are funneled through single chokepoints, and each chokepoint fans out to five telemetry sinks. The result is a stack that answers the hard production questions without guesswork.

Multi-Probe Bayesian Spam Gating: Filtering Junk Before Spending Compute

· 40 min read
Vadim Nicolai
Senior Software Engineer

In a B2B lead generation pipeline, every email that arrives costs compute. Scoring it for buyer intent, extracting entities, predicting reply probability, matching it against your ideal customer profile — each module is a DeBERTa forward pass. If 40% of inbound email is template spam, AI-generated slop, or mass-sent campaigns, you are burning 40% of your GPU budget on garbage.

The solution is a gating module: a spam classifier that sits at stage 2 of the pipeline and filters junk before anything else runs. But a binary spam/not-spam classifier is too blunt. You need to know why something is spam (template? AI-generated? role account?), how confident you are (is it ambiguous, or have you never seen this pattern before?), and which provider will block it (Gmail is stricter than Yahoo on link density).

This article documents a hierarchical Bayesian spam gating system with 4 aspect-specific attention probes, information-theoretic AI detection features, uncertainty decomposition, and a full Rust distillation path. The Python model trains on DeBERTa-v3-base. The Rust classifier runs at batch speed with 24 features and zero ML dependencies.