Skip to main content

The Autonomy Gate: How Multi-Level Agent Evaluation Turns Human Approval Into Machine Approval

· 27 min read
Vadim Nicolai
Senior Software Engineer

The standard rubric for autonomous agent evaluation is seductive: high autonomy means a self-directed plan–act–verify loop with minimal human intervention. Medium means agentic but human-triggered. Low is a single prompt in a static pipeline. Most engineering teams obsess over the plan and act phases—more agents, better prompts, faster inference. They assume the verify step, the human approval interrupt, can stay forever.

This assumption is wrong.

In a production LangGraph fleet of 45 graphs running on a single DeepSeek egress behind a Cloudflare AI Gateway, the bottleneck wasn’t the plan or act stages. The verify stage was the bottleneck. Every outreach draft—composed, held, pending—stopped at a human approval interrupt. Nothing sent without my decision. I realized that adding more autonomous actors (more discovery agents, more composers) did not raise the fleet’s autonomy ceiling. Only automating the verify step could do that. The evaluation harness is not overhead on an agent fleet—it is the component that converts human approval gates into machine approval gates.

This is the autonomy gate: a multi-level evaluation architecture that systematically moves the locus of approval from human judgment to machine assessment. (A note on sourcing: the fleet numbers throughout this piece—45 graphs, the 0.80 composite bar, the 50-verdict flip criterion—are first-person measurements from my own production deployment, not external benchmarks; every research claim links to its paper.) Once you understand its mechanism, you see it everywhere—from surgical robots to gait analysis AI to the recommendation engine on your phone. Once you understand the failure modes, you cannot unsee them.

What Is the Autonomy Gate? Not a Binary Switch

Loading diagram…

Most discussions treat autonomy as a binary: human decides or machine decides. The papers tell a different story. The autonomy gate is a cascade of micro-evaluations. Each layer incrementally transfers approval from human to machine.

In Autonomy in surgical robots and its meaningful human control, Ficuciello et al. (2018) mapped 5 levels of surgical robot autonomy. At level 0, the surgeon controls every motion. At level 4, the robot sutures with no real-time human input, its own safety monitor approving each stitch. At level 5, the robot plans the entire procedure. Each of the 5 levels adds an automated evaluation step that can preempt human approval, and the authors warned that “meaningful human control” becomes diluted unless the architecture preserves a human-in-the-loop for critical decisions.

In Beyond expertise and roles, Suresh et al. (2021) examined interpretable machine learning from a stakeholder perspective. Across the frameworks they surveyed, they found exactly 0 that explicitly address machine stakeholders—algorithms that evaluate other algorithms. In a multi-level evaluation hierarchy, each layer is an algorithm evaluating the outputs of another algorithm. The final arbiter issues a verdict that a human rubber-stamps because the human sees only the summary. Suresh et al. argued that the needs of these algorithmic stakeholders are systematically overlooked. The gap means that when an evaluation layer is itself an algorithm, its biases become invisible to human oversight.

A production agent fleet demonstrates the same pattern at a finer granularity. The adopted mechanism evaluates at 3 levels—step, trajectory, outcome—and composes a single verdict.

Step-level checks individual outputs against golden expectations. Scalar comparisons run deterministically (0 LLM calls), free-text expectations go to one batched LLM judge call, and with no goldens at all, one holistic well-formedness check executes. Trajectory-level uses one judge call over the ordered step board: does each step advance the goal, is the ordering sensible, are there redundant loops? Outcome-level runs deterministic guardrails first—unresolved template placeholders like {{first_name}}, spam-trigger phrasing, empty or oversized copy. Then one LLM compliance-and-grounding judge reads the draft plus the composer’s evidence.

The composite score is the mean of scored levels, gated at 0.80—the same bar the fleet’s offline LangSmith golden datasets use. A pass requires composite ≥ 0.80 AND 0 hard violations. A level whose judge call failed is “unscored,” and an unscored level can never auto-approve anything. Every verdict carries provenance: confidence, reason, source (a versioned prompt id), and evidence (level scores and violation codes—never draft text).

This is not a single gate. It is an approval pipeline with multiple checkpoints. Each one potentially overrides or bypasses human review.

How Multi-Level Agent Evaluation Works in Practice: A Corpus That Triages Itself

Loading diagram…

Before any evaluation gate runs, the system must decide what to evaluate. The fleet’s research corpus is not curated by a human scanning arXiv. A Cloudflare Python Worker scrapes roughly 2,000 papers per topic campaign from OpenAlex, Semantic Scholar, Crossref, and CORE on a 5-minute cron tick. An LLM then classifies and “lane-maps” each paper against the fleet’s real architecture.

The lane rubric has 4 tiers:

  • CLEAN — buildable today on the existing StateGraph + LLM + D1 stack, no new infrastructure
  • ADAPT — needs exactly one missing component: embeddings, outcome labels, or a new durable thread
  • OFFLINE-ML — the contribution is a trained model; only the feature taxonomy ports
  • NOISE — everything else

Each paper also gets an autonomy grade (high/medium/low) and a sales-motion tag (outreach/compose, lead scoring, discovery/enrichment, eval/guardrail).

Of one morning’s 8 top high-autonomy buildable picks, only one—a 2026 agentic-evaluation paper—landed in the CLEAN tier. The other 7 each required infrastructure the fleet does not run—browser/vision scraping, knowledge distillation, a heterogeneous LLM pool, vector infrastructure. The selection logic is deterministic: tier (CLEAN before ADAPT) then autonomy (high first) then recency. No scoring, no votes. This is a machine triage pipeline that decides which research even reaches me. The autonomy gate starts before any code is written.

In A practical guide to multi-objective reinforcement learning and planning, Hayes et al. (2022) described a similar principle in multi-objective reinforcement learning (MORL). Agents can learn to trade off 10 or more conflicting objectives (safety, efficiency, ethics) without continuous human feedback. The agent’s own value function becomes the evaluator, replacing the need for human approval in each decision. Once the value function is trained, the agent’s internal evaluation is the gatekeeper. Hayes et al. demonstrated that MORL agents can simultaneously optimize more than 10 objectives—a quantitative claim about capacity. The fleet’s corpus triage does the same: an LLM classification layer decides which papers are worth human time, effectively pre-approving the research direction.

When Human Approval Becomes Machine Approval: The Shadow Gate

Loading diagram…

The gate does not assert autonomy; it earns it with a measured agreement loop. The integration point in the campaign graph is straightforward: check_reply → compose_touch (generate and hold the draft) → gate_draft (NEW) → await_approval (human interrupt) → send_touch → schedule_next. The gate_draft node runs the multi-level verdict in-process on every held draft, records it to a durable verdicts table, and attaches the outcome (passed, composite, violation codes, one-line rationale) to the approval interrupt payload the human sees.

Shadow mode is the default: the verdict is recorded, but the human interrupt always still fires. Zero behavior change on day one. Auto-approve mode is a flag: a verdict that passes with no hard violations and every requested level scored routes directly to send_touch with an 'auto' audit stamp. Judge outages, failures, and hard violations always fall back to the human. Gate errors fail OPEN to the human interrupt—evaluating must never block a draft.

In Taming the eHMI jungle, Dey et al. (2020) studied external human–machine interfaces for automated vehicles, building a unified taxonomy that compares eHMIs across 18 dimensions and coding 70 eHMI concepts against it. In the systems they catalogued, the vehicle’s own perception-planning loop evaluates pedestrian intent and decides whether to yield. The machine’s evaluation of intent becomes the gatekeeper, replacing the pedestrian’s human signal (e.g., a hand wave) entirely. The structural parallel to the fleet’s 0.80 composite bar is exact: in both cases a machine confidence judgment decides when machine approval supersedes human approval.

Every shadow verdict row is later backfilled with my actual decision (approve / edit / reject / skip) when I resolve the interrupt. Agreement semantics are strict: only an outright APPROVE counts as the human siding with a pass. An EDIT means the draft was not send-worthy as-is—it agrees with a fail and disagrees with a pass. The flip criterion to enable auto-approve: agreement ≥ 0.80 over at least 50 human-decided shadow verdicts AND zero “rejected passes” (gate-passed drafts the human outright rejected). One SQL query answers it.

This is the same trust-building pattern as staged rollouts in classical deployment: shadow → measure agreement → gate a small slice → widen. The threshold—0.80 agreement, 50 verdicts, zero rejected passes—is both aggressive and cautious. It requires empirical evidence that the gate’s decisions match human judgment before it earns the right to skip the human.

The Safety Risks of a Self-Certifying Gate: Failure Modes to Survive

Loading diagram…

The gate is only as trustworthy as its weakest evaluator. Several failure modes are baked into the architecture.

Self-preference. The judge model evaluating its own family’s output inflates scores—a known LLM-as-judge bias documented in the survey literature (e.g., Gu et al., 2024). Mitigations in this system: deterministic guardrails are model-free and can veto any score, and the flip criterion is human agreement, never the judge’s self-reported quality. The fleet runs on a single DeepSeek egress, so the judge and composer share the same model family. Self-preference is not theoretical; it is an everyday risk.

Prompt injection. The draft and evidence are attacker-influenced text. A scraped bio can contain “ignore previous instructions.” Every judge prompt fences run data as data with an explicit do-not-follow-instructions wrapper. But wrappers are brittle—no one has proven a general defense against prompt injection.

Goodhart’s Law. If the composer can see the gate’s exact regex and marker lists, it learns to pass the test rather than write well. The guardrail lists live only in the eval module, never in composer prompts. This is an architectural separation that prevents the composer from gaming the gate.

Judge outage. Per-level fail-open: the level goes unscored, a soft violation records the outage, deterministic checks still stand, and an unscored run can never auto-approve. A kill switch halts all LLM paths and the gate degrades to its deterministic subset. This graceful degradation prioritizes safety over autonomy.

Calibration drift. Judge prompts carry a version string stamped into every verdict’s provenance, and agreement stats are recomputed continuously from the persisted rows. A drifting judge surfaces as falling agreement—the same metric that enables auto-approve can also trigger a rollback.

Suresh et al. (2021) noted that no existing interpretability framework explicitly addresses machine stakeholders. These failure modes illustrate why: when an evaluation layer is itself an algorithm, its biases and blind spots become invisible to human oversight. In Social Robots on a Global Stage, Lim et al. (2020) synthesized 20 years of human–robot interaction evidence on cultural influence and showed that social robots can misread cultural cues, producing inappropriate responses that the robot’s own evaluation deems acceptable—a generalisability problem the authors trace to heterogeneous methods and low statistical power across the reviewed studies. Ahmad et al. (2022) described personality-adaptive conversational agents that adapt communication style to user sentiment. The agent’s evaluation of user mood replaces explicit human feedback, so the machine approves its own interaction strategy even when it reinforces bias.

The Broader Pattern: From Surgical Robots to Gait Analysis

The autonomy gate is not limited to LLM agents. Ficuciello et al. (2018) explicitly mapped how each autonomy level in surgical robots shifts evaluation from human to machine. At level 4, the robot’s sensor feedback and pre-programmed constraints approve each movement; the human is a passive observer. The authors argued that “meaningful human control” requires preserving a human-in-the-loop for critical decisions, but the architecture often makes that loop optional.

The most striking data point comes from A Survey of Human Gait-Based Artificial Intelligence Applications (Harris et al., 2022), which swept published work from 2012 to mid-2021 and identified 6 key application areas of machine learning on gait data, from clinical gait analysis to biometrics and smart wearables. Of the gait AI studies they reviewed, over 70% do not involve human verification. In healthcare analytics, machine approval has quietly become the default: a model trained on gait data evaluates injury risk without a clinician reviewing each prediction. The autonomy gate has already swung, and most practitioners haven’t noticed.

Hayes et al. (2022) argued that MORL-based evaluation can be more consistent and transparent than human judgment—agents can evaluate 10 or more conflicting objectives simultaneously, far exceeding human capacity. In domains like gait analysis or surgical robotics, machine evaluation may be a requirement because humans cannot process the data volume. But consistency does not equal correctness. When the gate swings irrevocably, we trade human fallibility for machine brittleness.

The Stakeholder Gap: Who Evaluates the Evaluator?

Loading diagram…

Suresh et al. (2021) found that 0 existing interpretability frameworks explicitly address machine stakeholders. In a multi-level evaluation architecture, each level is an algorithm evaluating the outputs of another algorithm. The final arbiter—often another LLM or a threshold function—issues a verdict that a human rubber-stamps because the human sees only the summary.

The fleet addresses this by requiring its own gate to have a golden dataset: deterministic fixtures (clean pass, step mismatch, hard-violation veto, fail-closed empty run) held to the same 0.80 bar. This is a form of meta-evaluation, but it only catches known failure modes. Suresh et al. (2021) would ask: who evaluates the evaluator when the system encounters an input that the golden dataset doesn’t cover? The answer is no one—until the next shadow-verdict backfill reveals a drift.

The agreement loop is the only feedback mechanism that accounts for the human’s perspective, but it depends on the human still making independent decisions. If the human always approves because the gate’s summary seems authoritative, the agreement numbers inflate and the flip criterion loses meaning. Dey et al. (2020) describe the equivalent dynamic in automated vehicles: once the vehicle’s own perception loop evaluates pedestrian intent, the pedestrian’s explicit signal stops being the deciding input. The human exit from the evaluation loop is a social, not just technical, failure.

How to Design a Genuine Human-in-the-Loop Gate

Not every agent fleet needs to earn autonomy via a shadow gate. But if you are building a system where human approval is the verify step, consider the following framework grounded in the evidence from this 45-graph production fleet and the research literature:

CriterionActionThreshold
Agreement between gate and human decisionsShadow mode until measured≥ 0.80 over ≥ 50 decisions, zero gate-passed rejections
Judge model and composer model same family?Implement model-free guardrails as vetoHard violations always escape to human
Evaluation levels defined?Step, trajectory, outcome (or analogous)Each level must have deterministic or judge-based check
Judge outage scenario?Fail-open to deterministic subsetKill switch must not block production
Prompt injection surface?Data fencing in judge promptsDo-not-follow-instructions wrapper
Goodhart risk?Guardrail lists isolated from composer promptsArchitectural separation, not prompt-level obfuscation

The most practical approach today is shadow mode first: deploy the multi-level gate as a passive observer, collect human decisions, compute agreement, and only flip to auto-approve after meeting the empirical threshold. Do not skip the deterministic guardrails—they are the only model-free safety net. And never let the human see only the gate’s summary; provide the draft and evidence independently so the human can judge without bias.

Eval-first applies to the gate itself. The gate needs its own golden dataset, failure mode fixtures, and continuous calibration monitoring. If the agreement curve starts dropping, re-evaluate the judge prompt or model.

The Gate Can Swing Back

Loading diagram…

The autonomy gate is not a one-way door. The agreement loop and shadow mode provide a mechanism to reverse the transition. If auto-approve degrades agreement below 0.80, the system can automatically revert to shadow mode. The gate can swing back toward human approval—but only if the architecture keeps the human decision infrastructure alive. Once you remove the human interrupt entirely, you lose the ground truth for agreement measurement.

The research literature collectively warns: once the evaluation loop is closed by machines, human error may be replaced by machine error (Ficuciello et al., 2018; Lim et al., 2020; Ahmad et al., 2022). The solution is not to reject machine evaluation but to design it with explicit human approval thresholds as a parameter, not an afterthought. Nagy et al. (2018) described Industry 4.0 factories where machine evaluation replaces human approval in operational decisions; they treated full automation as a goal. That is one philosophy. The fleet described here took the opposite approach: earn every unit of automation with empirical evidence of human–machine agreement.

The open question is whether that approach scales. Once the gate swings fully—when the human has not seen a non-trivial evaluation failure for months—will we keep the shadow infrastructure alive, or will we declare the gate permanent? The answer will determine whether the autonomy gate remains a design tool or becomes a permanent lock on human oversight.


References

LLM Lead Conversion-Propensity Scoring for B2B Lead Prioritization

· 12 min read
Vadim Nicolai
Senior Software Engineer

The published literature on lead scoring converges on a couple of recurring findings. A B2B feature-importance analysis identified lead source and lead status as the most predictive conversion features (Frontiers in AI, 2025). And a supervised classifier trained on labelled outcomes tends to beat both rule-based heuristics and manual qualification. Yet many B2B teams deploying an LLM for lead prioritisation skip the classifier, skip the labelled outcomes, and instead ask the model to reason its way to a score from contact evidence. Is that defensible, or is it cargo-cult AI?

LLM Sales-Email Intent Scoring for Inbound Lead Prioritization

· 10 min read
Vadim Nicolai
Senior Software Engineer

A practical LLM-based intent-scoring design can do exactly one thing: make a single call to a language model, read a few floating-point scores, and fall back to a keyword heuristic if the model fails. No multi-agent orchestration. No fine-tuned BERT. No LightGBM ensemble. And according to the 2026 literature, an LLM semantic scorer outperforms keyword-based intent detection (Sanjei et al., 2026). The useful insight is that an effective design for sales-email intent scoring can also be one of the simplest — a bounded, schema-constrained LLM step embedded inside an existing dataflow graph, designed to fail open rather than cascade errors downstream. This article unpacks why that design is attractive, what the research actually says, and how to build it without over-engineering.

Knowledge-Graph RAG for Explainable Lead and Account Recommendation

· 11 min read
Vadim Nicolai
Senior Software Engineer

If your first instinct on hearing "knowledge graph" is to reach for Neo4j, you may be over-engineering a lead recommendation system. The past year's research on combining knowledge graphs with retrieval-augmented generation (KG-RAG) for recommendations converges on a pragmatic insight: the most effective KG is often the one you already have. In this design, that is a normalized relational schema of companies, contacts, opportunities, and emails living in a Cloudflare D1 database. Traversing those foreign keys at query time, bounding the fan-out, and feeding the resulting subgraph into an LLM can produce recommendations that are grounded by construction — every explanation path is required to trace back to a real row in the operational store.

Durable Execution in LangGraph: Agents That Survive Failure and Resume Where They Left Off

· 12 min read
Vadim Nicolai
Senior Software Engineer

Most AI agents are built as a single process holding state in memory: a while loop, local variables, maybe a sleep(). That holds up until the workflow has to outlive the process that started it — and in production it always does. The math is unforgiving: chain ten steps that each succeed 85% of the time and the whole run finishes only about 20% of the time (0.85¹⁰ ≈ 0.20). Without durability, every one of those failures restarts from scratch. The model might be reliable; the tool calls aren't. Better LLMs don't fix network failures — only durable execution does.

The research consensus is that the infrastructure around the model, not the model itself, is where production agents live. The 2026 design-space analysis Dive into Claude Code found that only 1.6% of Claude Code's codebase is AI decision logic; the other 98.4% is operational infrastructure for context management, tool routing, and recovery. LangGraph's answer to that reality is durable execution through its persistence layer — making the agent a row in a checkpoint store, not a stack frame in a living process. This article dissects how that works, the sharp edges it creates, and how to observe a workflow that — by design — no longer runs as a single process.

LangGraph v3 Event Streaming: Typed Projections Over a Content-Block Protocol

· 13 min read
Vadim Nicolai
Senior Software Engineer

Streaming an LLM to a user is easy. Consuming the stream on the server — token deltas, reasoning deltas, tool-call chunks, per-node state, subgraph events, usage metadata — is the part that turns into a pile of if chunk["type"] == ... branches. I shipped a streaming endpoint last week on LangGraph version="v2", because that is what's installed (1.1.8 locally, 1.2.4 on the server). The hand-rolled consumer was about twenty lines of fragile branching, a keepalive hack to stop a proxy from dropping the connection during DeepSeek's silent reasoning phase, and a manual accumulator that reset whenever langgraph_node changed.

LangGraph's version="v3" event-streaming API is what I'd reach for next, and the diff is the interesting part: it deletes most of that parsing. Instead of one undifferentiated event firehose you branch on, v3 gives you typed, per-channel projections you iterate independently, built on a content-block protocol that makes text, reasoning, tool-call, and multimodal boundaries explicit. v1 and v2 are unchanged. This is a walk through what v3 actually is, what it removes from your code, and where it still leaves work for you.

The problem v3 is aimed at

Every application that streams a graph back to a client confronts the same mess. The raw output is a sequence of events: messages, state snapshots, tool outputs, lifecycle transitions. Without a protocol, your code branches on the shape of each event, unpacks nested dictionaries, correlates tool calls by id, and accumulates tokens across node boundaries by hand. It works for a two-node graph and falls apart the moment you add a subgraph or a second LLM call.

The earlier versions exposed that raw stream directly. As described in the LangGraph streaming docs, version="v1" yields (mode, data) tuples; version="v2" yields StreamPart dicts you branch on by chunk["type"]. Both leave the bookkeeping to the consumer. v3 introspects the graph's execution channels and exposes typed projectionsrun.messages, run.values, run.lifecycle, run.subgraphs — over a content-block protocol that models LLM output as discrete blocks with explicit start, delta, and finish boundaries. The framework tracks the LLM calls, correlates the tool chunks, and separates reasoning from text, so you don't.

A walk through the v2 streaming code

Here is the shape of the v2 consumer I actually run, in a FastAPI endpoint behind a Next.js front end serving an email-compose graph:

async for part in graph.astream(payload, config, stream_mode=["messages", "values"], version="v2"):
if part["type"] == "messages":
msg, meta = part["data"] # (message_chunk, metadata)
node = (meta or {}).get("langgraph_node")
if node != current_phase: # reset accumulator on draft -> refine
current_phase = node
accumulated = ""
delta = getattr(msg, "content", "") or ""
if delta:
accumulated += delta
yield sse({"type": "chunk", "accumulated": accumulated, "phase": node})
elif part["type"] == "values":
last_values = part["data"] # keep latest; the last one is final

Every line is overhead the framework could own. Branch on part["type"]. Unpack the tuple. Track the current node to know when to reset the accumulator (the graph drafts in one node and refines in another, and you do not want the two concatenated). On top of this loop sits a second mechanism I'll come back to: the run is driven through an asyncio.Queue consumed with a timeout so the endpoint can emit a : keepalive comment during the model's silent reasoning phase — reasoning tokens never land on .content, so without it the stream looks idle and an intermediate proxy drops it. The first byte is a : open SSE comment, flushed before the graph produces anything, so the browser sees HTTP 200 immediately instead of a bodyless 504 on a cold backend.

The v3 equivalent collapses the message handling to this:

stream = await graph.astream_events(payload, version="v3")
async for message in stream.messages: # one ChatModelStream per LLM call
async for token in message.text:
await sse({"type": "chunk", "text": token, "node": message.node})
final = stream.output

No part["type"] branch, no (msg, meta) unpack, no accumulator, no phase reset. Because run.messages yields one ChatModelStream per LLM call, the draft and refine passes are already separate streams — the bookkeeping that the phase-reset hack did by hand is structural now. Reasoning deltas move to message.reasoning, and usage moves to message.output.usage_metadata.

Two layers and a content-block protocol

Under the hood, v3 is two layers. The streaming layer is the Pregel engine emitting raw graph-execution events on named channels: values, updates, messages, tools, lifecycle, checkpoints, input, tasks, custom. The event-streaming layer normalizes those into protocol events, routes each through a stack of transformers, and exposes the typed projections your code consumes. The router is the bridge: it passes each normalized event through the registered transformers, the built-in ones producing run.messages, run.values, run.lifecycle, and run.subgraphs. Multiple consumers can read different projections concurrently — reading run.messages does not consume the events run.values needs.

If you drop to the raw layer, every event is a ProtocolEvent envelope — the same shape defined in the Agent Protocol:

class ProtocolEvent(TypedDict):
seq: int # strictly increasing within a run; use this for ordering
method: str # channel: "messages" | "values" | "updates" | "tools" | ...
params: ProtocolEventParams # namespace, timestamp, channel-specific data

params.namespace is the path from the root graph to the emitting scope. The root is []; a nested tool call inside a subgraph looks like ["researcher:6f4d", "tools:91ac"], where the name before the colon is the stable node/graph name and the suffix is a per-invocation id. The run.subgraphs projection does that namespace filtering for you, which is the whole point — you rarely want to parse those strings by hand.

The messages channel is the content-block protocol. Each message arrives as a sequence of blocks, and data.event is one of message-start, content-block-start, content-block-delta, content-block-finish, message-finish. A block starts, emits zero or more deltas, and finishes before the next block in the same message begins. That explicit boundary is what makes text, reasoning, tool-call arguments, and multimodal content distinguishable without provider-specific parsing. If you want the raw deltas instead of the projection, you filter by block type:

for event in stream:
if event["method"] != "messages":
continue
data = event["params"]["data"][0]
if not isinstance(data, dict) or data.get("event") != "content-block-delta":
continue
block = data.get("delta") or {}
if block.get("type") == "text-delta":
print(block.get("text", ""), end="", flush=True)
elif block.get("type") == "reasoning-delta":
print(f"[thinking]{block.get('reasoning', '')}", end="", flush=True)

message-finish may carry token usage; an unrecoverable model-call failure arrives as a message error event rather than a thrown exception mid-stream.

Typed projections in practice

The projections are where v3 earns its keep. run.messages yields one ChatModelStream per LLM call, exposing .text, .reasoning, .tool_calls, and .output.usage_metadata. message.text is iterable for token-by-token output, or you call str(message.text) for the whole thing:

for message in stream.messages:
for token in message.text:
print(token, end="", flush=True)
usage = message.output.usage_metadata
final_state = stream.output

run.values streams full state snapshots after each step, and stream.output resolves to the final value. run.lifecycle emits started / running / completed / failed / interrupted per run, subgraph, and subagent, with an optional graph_name, error, and cause. run.subgraphs surfaces nested executions as their own objects (.graph_name, .path, .messages). For concurrent consumption in async code you await graph.astream_events(...) and asyncio.gather over the projections; for ordered consumption in sync code you use stream.interleave("values", "messages", "subgraphs"), which yields items in strict arrival order.

The evolution across versions, side by side:

Scenariov1 (default)v2v3
Single stream moderaw data (dict)StreamPart dict {type, ns, data}typed projection iterators
Multiple modes(mode, data) tuplessame StreamPart, branch on chunk["type"]iterate run.messages / run.values / …
LLM output(message_chunk, metadata)same, inside StreamPart.dataChatModelStream with .text / .reasoning / .tool_calls / .output.usage_metadata
Consumer codebranch on shapebranch on typeiterate the projection you want

And the channels, mapped to what you'd actually use them for:

ChannelProjectionUse
messagesrun.messageschat-model tokens, reasoning, tool-call args, usage
valuesrun.valuesfull state snapshots; stream.output is the final value
lifecyclerun.lifecyclerun / subgraph / subagent status transitions
(nested)run.subgraphsnested graph executions without namespace parsing
updates / custom / checkpoints / tasks / debugopt-in transformersper-node deltas, custom events, time-travel, task/debug detail

Custom projections via transformers

When the projection you want does not exist, you write a StreamTransformer: init(), process(event), finalize(), and fail(err). process sees every ProtocolEvent and returns True to pass it through or False to suppress it. A transformer declares required_stream_modes so the runtime knows which raw channels to emit — the runtime takes the union across all registered transformers, and a mode no transformer requests is never produced.

A token-usage aggregator is the obvious first one, and it doubles as an observability hook:

class StatsTransformer(StreamTransformer):
required_stream_modes = ("messages",)
def __init__(self, scope=()):
super().__init__(scope)
self.total = 0
self.log = StreamChannel[int]()
def init(self):
return {"total_tokens": self.log}
def process(self, event):
d = event["params"]["data"]
if isinstance(d, dict):
self.total += (d.get("usage") or {}).get("output_tokens") or 0
return True
def finalize(self):
self.log.push(self.total)
self.log.close()

Register it per call — stream_events(inp, version="v3", transformers=[StatsTransformer]) — or at compile time so every run produces the projection. StreamChannel is the projection primitive. A named channel (StreamChannel("total_tokens")) is exposed under stream.extensions and each push() also flows into the main stream as a custom:total_tokens event, so its payload must be serializable. An unnamed channel is a side projection only, which is where you keep in-process handles that can't be serialized. The built-in ToolCallTransformer uses exactly this contract to expose stream.tool_calls.

What this does for tracing and observability

The reason I care about projections is not ergonomics for its own sake — it's that the data I want for telemetry stops being something I scrape out of logs. The system I run already carries two observability planes: LangSmith tracing natively, and OpenTelemetry over OTLP, with the TypeScript client injecting W3C traceparent and langsmith-trace headers so a single trace spans the browser, the Vercel route, the FastAPI process, and the graph. The missing piece, on v2, is structured per-call signal inside the stream. v3's projections are that signal:

  • Per-call token cost and the reasoning split. message.output.usage_metadata gives input/output tokens per LLM call. Thinking-mode cost is otherwise close to invisible, because .content stays empty while the model reasons; message.reasoning makes those tokens a first-class projection you can meter.
  • Time-to-first-token. Timestamp the first item out of message.text against the run's start. On v2 you reconstruct this from a noisy delta stream; here it's one event.
  • Per-node and per-subgraph latency. run.lifecycle emits started and completed transitions, and seq on each ProtocolEvent gives you a reliable order without trusting wall-clock timestamps that can drift.
  • Spans that match the graph. run.subgraphs mirrors the execution topology, so an OpenTelemetry span tree can follow the actual DAG instead of a flattened log.

A small StreamTransformer that pushes usage and lifecycle events into your metrics exporter is the natural bridge: the transformer observes the raw events, and a named channel forwards the derived numbers into the same trace the rest of your request already belongs to.

The parts v3 does not do for you

Streaming over SSE still has sharp edges, and the framework owning the parsing doesn't make them disappear. The keepalive problem is the clearest example: a model that reasons silently produces no message.text tokens for tens of seconds, and an idle SSE connection gets dropped by whatever proxy sits between you and the browser. You still emit a periodic comment frame to keep the connection warm — v3 just makes the reason for the silence legible, because the reasoning deltas are a projection you can surface as a "thinking…" status instead of dead air. Error frames (message-finish carrying an error, or a failed lifecycle event) need handling so a mid-stream model failure becomes a clean event to your client rather than a truncated response. And client disconnects should abort the upstream run, or you pay for tokens nobody reads.

Honest limits

v3 requires a recent langgraph release; older versions expose only v1/v2 and the async astream_events, so plenty of running systems (mine included) still stream on v2. Before migrating, the edges worth knowing:

  • Structured output streams as characters, not prose. If a node uses JSON mode and the model emits {"subject": "...", "body": "..."}, the token stream is braces and quotes. The projection gives you tokens; the authoritative parsed result still comes from the final state, not from reassembling the stream. This is true on every version — it's a property of structured generation, not of v3.
  • Reasoning is separate from text. Read only message.text and you miss message.reasoning. If you want the full output you consume both.
  • Sync and async iterate differentlyfor versus async for over message.text. On Python < 3.11, async consumption needs explicit RunnableConfig propagation and a writer= argument instead of get_stream_writer, because that runtime can't carry the context automatically.
  • Named StreamChannel payloads must be serializable, since they become custom:<name> events on the main stream. Promises, async iterables, and class instances belong in unnamed channels.

When to reach for it

SituationWhat I'd do
New LangGraph project, no streaming code yetBuild on v3. The projections are simpler than the v2 branching, and the simplicity compounds as the graph grows subgraphs and parallel nodes.
Existing v1/v2 system in productionStay on v2 until the v3 surface you depend on has settled. The ergonomic win is real, but it does not by itself justify reworking a path you already ship.
You need reasoning deltas, tool-call argument streaming, or per-call usagev3 is the clean path; v2 makes you reconstruct all three by hand.
Python < 3.11 with asyncBudget time for the explicit RunnableConfig / writer= pattern before you commit.
Observability is a first-order requirementv3's run.lifecycle and run.subgraphs turn per-node latency and failure data into projections instead of log archaeology.

The shape of the change

The redesign reflects a move from "stream the tokens" to "treat a graph execution as a structured protocol." Branching on chunk["type"] is fine for a two-node graph; it grows badly the moment you add subgraphs, parallel nodes, and multiple coordinated LLM calls. v3 pulls that complexity into the framework, and the consumer code ends up doing what it says — iterate the messages, read the usage, await the output. I'm still on v2 in production, but I've already rewritten the consumer in a branch to see the diff, and the diff is the argument. The next time you write a streaming endpoint, you'll want to know which version you're reaching for.


References

  1. LangGraph (OSS) — Streaming: stream modes and version="v2". https://docs.langchain.com/oss/python/langgraph/streaming
  2. LangGraph (OSS) — Event streaming: version="v3", typed projections, transformers, the ProtocolEvent envelope and channel list. https://docs.langchain.com/oss/python/langgraph/event-streaming
  3. Agent Protocol — wire-level event and command formats: langchain-protocol (PyPI), @langchain/protocol (npm), and langchain-ai/agent-protocol on GitHub.
  4. LangSmith — tracing quickstart (native LangGraph tracing).
  5. OpenTelemetry — OTLP exporter and span model (vendor-neutral tracing referenced in the observability section).

Observable AI Memory: mem0, LangGraph, and Qdrant with Enterprise-Grade Telemetry

· 13 min read
Vadim Nicolai
Senior Software Engineer

Most "AI memory" demos stop at memory.add() and memory.search(). That works on a laptop. It does not survive contact with production. The real questions are: When this recall is slow, which store is to blame? When a graph's spend triples overnight, which feature caused it? When a customer asks "what did your agent remember about me, and when?", can you answer from an audit log instead of a shrug?

TL;DR — This field report shows how to build an agent memory layer where every operation honors a contract: fail-open, PII-safe, and fully instrumented. Three stores (mem0, Qdrant, LangGraph) are funneled through single chokepoints, and each chokepoint fans out to five telemetry sinks. The result is a stack that answers the hard production questions without guesswork.

Multi-Probe Bayesian Spam Gating: Filtering Junk Before Spending Compute

· 44 min read
Vadim Nicolai
Senior Software Engineer

In a B2B lead generation pipeline, every email that arrives costs compute. Scoring it for buyer intent, extracting entities, predicting reply probability, matching it against your ideal customer profile — each module is a DeBERTa forward pass. If 40% of inbound email is template spam, AI-generated slop, or mass-sent campaigns, you are burning 40% of your GPU budget on garbage.

The solution is a gating module: a spam classifier that sits at stage 2 of the pipeline and filters junk before anything else runs. But a binary spam/not-spam classifier is too blunt. You need to know why something is spam (template? AI-generated? role account?), how confident you are (is it ambiguous, or have you never seen this pattern before?), and which provider will block it (Gmail is stricter than Yahoo on link density).

This article documents a hierarchical Bayesian spam gating system with 4 aspect-specific attention probes, information-theoretic AI detection features, uncertainty decomposition, and a full Rust distillation path. The Python model trains on DeBERTa-v3-base. The Rust classifier runs at batch speed with 24 features and zero ML dependencies.