Skip to main content

Agentic CLEAR: Automating Multi-Level Agent Evaluation — and the Autonomy Gate It Unlocks

· 17 min read
Vadim Nicolai
Senior Software Engineer

Every team running an agent fleet has the same blind spot. Observability platforms—MLflow, Langfuse, home-grown OpenTelemetry—capture execution traces beautifully. They show you what the agent did. They say almost nothing about whether it did it well. So a developer opens the trace viewer, scrolls through a few hundred spans, and tries to eyeball a systemic failure out of thousands of runs. The research alternative is worse: hand-built error taxonomies that take weeks to annotate and go stale the moment the agent changes. What both approaches lack is automated multi-level agent evaluation—judgment of the trajectory itself, not just a record of it.

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents, by Yehudai, Eden, and Shmueli-Scheuer (2026) at IBM Research, attacks exactly this gap. It is an open-source Python package—pip install clear-eval—that reads raw agent traces and produces data-driven evaluation at three levels of granularity, surfaces recurring failure patterns without a predefined taxonomy, and renders the whole thing in an interactive dashboard. It reports up to 0.890 AUC for predicting trajectory success in a fully reference-less setting. This post walks through what the paper actually does, then shows how I wired the same multi-level shape into a 45-graph production fleet as an "autonomy gate"—the component that turns a human approval interrupt into a machine one.

What Agentic CLEAR Does: Evaluation Above the Observability Layer

Loading diagram…

The first design decision is the most important one: Agentic CLEAR operates above the observability layer rather than inside the agent. It does not ask you to instrument your graph with new callbacks or adopt its runtime. It consumes the traces your stack already emits. The supported matrix at release is concrete—LangGraph with MLflow, LangGraph with Langfuse, and CrewAI with Langfuse are all supported out of the box, and any framework at all is supported through a CSV adapter. OpenTelemetry-compatible traces from those sources are converted into a single unified representation before any judging happens, so the rest of the pipeline is framework-agnostic.

The second decision is that the error taxonomy is dynamic, not hand-crafted. Classic agent error analysis starts from a fixed list of failure categories that a human wrote down in advance. That list cannot adapt to a new domain, and building it is exactly the manual annotation labor the paper is trying to eliminate. Agentic CLEAR instead lets the failure patterns emerge from the traces themselves. It is built on top of the CLEAR LLM-as-a-Judge methodology—the AAAI-2026 "error analysis made easy" line of work from the same group—and extends it from single-LLM outputs to full agent trajectories.

The third decision is the one this whole post orbits: evaluation happens at three levels of granularity—system, node, and trace. The system level surfaces recurring failure patterns across every run of the agent. The node level isolates how an individual component (a specific sub-agent or tool node) tends to fail. The trace level inspects where one particular trajectory went wrong, step by step. Single-score judging collapses these into one number and loses the distinction; a fleet can have a healthy average and a single rotten node, and only multi-level evaluation tells them apart.

Inside the Multi-Level Judgment

Between preprocessing and the dashboard sits the evaluation core, where an LLM judge reads each trace three different ways. Agentic CLEAR scores trajectory success with three complementary methods, and the paper is careful to treat them as complementary signals rather than competitors.

Step-wise evaluation assigns a score to each individual step and averages them—a fine-grained read that rewards consistently sound intermediate reasoning. Trace-level evaluation issues one holistic judgment over the whole trajectory, asking the blunt question of whether the run, taken as a whole, succeeded. Rubric-based evaluation scores the proportion of explicit criteria the trajectory satisfied, turning a fuzzy "was this good" into a checklist ratio. The same trace gets all three, because each catches failures the others miss: a trajectory can have high average step quality yet fail holistically (good moves, wrong destination), or satisfy most rubric criteria yet stumble on one fatal step.

Those per-trace judgments then feed CLEAR Aggregation, which clusters the free-text feedback into system-wide issues and node-specific issues. This is the step that converts thousands of individual critiques into a short, ranked list of recurring problems a human can actually act on. Across all of its experimental configurations, Agentic CLEAR surfaced 195 unique recurring issues—each one discovered, not pre-declared.

The dashboard is the payoff surface, and it is hierarchical by design. The System View reconstructs the multi-agent topology—an interactive graph of agents and transitions with call counts—and surfaces global issues. The Node View enables per-component error analysis with filtering and drill-down into score distributions. The Trace View provides step-level inspection with rubric evaluation and dimension scores. Around those three core views sit a trajectory explorer (browse runs filtered by length, agent, or score), path analysis (common path patterns, success-versus-failure), temporal analysis (how agent position and score progress across steps), and a score-prediction panel with ROC curves. You launch it with a single command, run-clear-agentic-dashboard, after a run that on the sample data takes about two minutes over three traces.

What the Numbers Say

The evaluation is broad: Agentic CLEAR was run over five benchmarks and seven agentic settings spanning more than 1,100 traces. The configurations are worth walking because they cover genuinely different agent shapes.

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (Trivedi, Khot et al., 2024) supplied the hardest setting: the CUGA agent on a GPT-4o backbone, 417 traces. AppWorld is an interactive-coding world of 9 apps operable through 457 APIs, and even GPT-4o solves only about 49% of its "normal" tasks—so a trace set drawn from it is rich in genuine failures, exactly what an evaluator wants to be stressed against. It is also the setting where Agentic CLEAR's reference-less score prediction peaks, which is not a coincidence: harder benchmarks separate good trajectories from bad ones more sharply.

GAIA: A Benchmark for General AI Assistants (Mialon et al., 2023) contributed two of the seven settings: a generalist agent on Claude 4.5 Sonnet and the same agent on GPT-4.1, 165 traces each. Running one task distribution under two model backbones is the cleanest way to check that the evaluator's findings track the agent rather than the judge—two models, one benchmark, 165 traces apiece. The fifth benchmark, Hugging Face's open deep-research scaffold, added two more model settings—Claude 4.5 Sonnet over 165 traces and OpenAI o3 over 117—with traces sourced from the Holistic Agent Leaderboard (Stroebl et al., 2025), the 21,730-rollout harness built to standardize exactly this kind of agent evaluation.

The tool-use and software-engineering axes come from τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024) and SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (Jimenez et al., 2024): a generalist agent on Claude 3.7 Sonnet over 50 τ-bench traces, and on Claude 4.5 Sonnet over 50 SWE-bench Verified traces. SWE-bench draws on 2,294 real GitHub issues across 12 Python repositories, and τ-bench is hard enough that gpt-4o clears under 50% of its tasks—two more failure-dense corpora where an automated analyst earns its keep.

Two results matter most. The first is alignment with human taxonomies. The authors validate Agentic CLEAR's automatically generated issues against TRAIL: Trace Reasoning and Agentic Issue Localization (Deshpande et al., 2025), a benchmark of 148 human-annotated traces carrying 841 errors under a taxonomy of 12 reasoning and planning categories—a benchmark so hard that the strongest long-context model in the original TRAIL paper scored about 11%. With GPT-5 as the judge and full-plus-partial matching, Agentic CLEAR covers all 12 categories and beats frequency baselines on macro F1—without ever being handed the category definitions. The dynamic taxonomy rediscovers the human one, including low-frequency error types a frequency baseline would drop.

The second is score prediction, the result with the most direct engineering consequence. Can the framework predict whether a trajectory will succeed, using only its own judgments and no reference answer? The three scoring modes each yield a predictor, and per Agentic CLEAR the answer is yes: trace-level evaluation is generally the strongest, peaking at 0.890 AUC on AppWorld. The best method varies by benchmark and agent—step-wise wins some, rubric-based wins others—which is the empirical case for keeping all three rather than collapsing to one. The headline reads simply: a machine evaluator can predict task success, reference-free, well enough to act on.

A concrete example from the paper makes the system-versus-node split tangible. Running the CUGA agent on AppWorld (GPT-4o backbone, GPT-5 judge), the system-level issues read like execution-flow pathologies: inefficient execution and incomplete coverage, skipped pre/post-action checks, premature failure declaration with no fallback, contaminated shopping carts, brittle entity resolution. The node-level issues for the TaskDecompositionAgent read completely differently: it assumes unsupported app capabilities, fails to return the user's intent verbatim for single-app tasks, omits required parameters, and adds assumptions the user never gave. Same traces, two altitudes, two distinct repair lists. That is the whole argument for multi-level evaluation in one screenshot.

Applying It: An Autonomy Gate in a 45-Graph Fleet

Loading diagram…

Reading the paper is one thing; running its shape in production is another. My own system is a LangGraph fleet of 45 graphs—discovery, enrichment, classification, scoring, and outreach agents—on a single DeepSeek egress behind a Cloudflare AI Gateway, with LangSmith tracing and per-graph golden datasets gated at 0.80 accuracy. (A sourcing note: the fleet numbers here—45 graphs, the 0.80 composite bar, the 50-verdict flip criterion—are first-person measurements from my own deployment, not figures from the paper; every paper claim links to its source.) The fleet had plan and act. The verify step was a human: every outreach touch is composed, held as a pending draft, and stopped at an approval interrupt. Nothing sends without my decision.

That makes the human review gate the ceiling on the fleet's autonomy. Adding more autonomous actors—more discovery agents, more composers—does not raise it. Only automating the verify step does. So I borrowed Agentic CLEAR's central move and ran a multi-level verdict on each held draft, mapped onto the paper's three altitudes: a step check (scalar expectations compared deterministically with zero LLM calls; free-text ones batched into one judge call), a trajectory check (one judge call over the ordered step board—does each step advance the goal, is the ordering sane, are there redundant loops), and an outcome check (deterministic guardrails first—unresolved {{first_name}} placeholders, spam phrasing, empty or oversized copy—then one compliance-and-grounding judge). The composite is the mean of scored levels, gated at 0.80, and a pass requires composite ≥ 0.80 and zero hard violations. A level whose judge call failed is "unscored," and an unscored level can never auto-approve anything.

The integration point in the campaign graph is one new node: check_reply → compose_touch → gate_draft (NEW) → await_approval → send_touch. Shadow mode is the default—the verdict is recorded but the human interrupt always still fires, so day one has zero behavior change. Auto-approve mode is a flag: a verdict that passes with no hard violations and every requested level scored routes straight to send_touch with an 'auto' audit stamp. Judge outages, failures, and hard violations always fall back to the human, and gate errors fail open to the interrupt—evaluating must never block a draft. The gate rules on what is sent; the upstream do-not-contact list stays authoritative on who may be contacted.

The gate does not assert autonomy; it earns it. Every shadow verdict is later backfilled with my actual decision—approve, edit, reject, or skip. Agreement semantics are strict: only an outright approve counts as siding with a pass; an edit means the draft was not send-worthy as-is. The flip criterion to enable auto-approve is agreement ≥ 0.80 over at least 50 human-decided shadow verdicts and zero "rejected passes" (gate-passed drafts I outright rejected). One SQL query answers it. This is the staged-rollout pattern applied to autonomy itself: shadow, measure agreement, gate a slice, widen—exactly the trust-building that Agentic CLEAR's 0.890-AUC predictor makes defensible in principle, since a machine that can call task success reference-free is a machine you can start to measure against your own judgment.

The Failure Modes a Self-Certifying Gate Must Survive

Loading diagram…

A reference-less machine evaluator is only as trustworthy as its weakest judge, and several failure modes are baked into the architecture rather than bolted on.

Self-preference. A judge model scoring its own family's output inflates the score—a well-documented LLM-as-judge bias surveyed by Gu et al. (2024). My fleet runs on a single DeepSeek egress, so judge and composer share a family; the risk is everyday, not theoretical. Two mitigations hold: the deterministic guardrails are model-free and can veto any score, and the flip criterion is human agreement, never the judge's self-reported quality.

Prompt injection. The draft and its evidence are attacker-influenced text—a scraped bio can carry "ignore previous instructions." Every judge prompt fences run data as data behind an explicit do-not-follow-instructions wrapper. Wrappers are brittle and no one has proven a general defense, which is why the deterministic veto sits underneath them.

Goodhart's Law. If the composer can see the gate's exact regex and marker lists, it learns to pass the test rather than write well. The guardrail lists live only in the eval module, never in composer prompts—an architectural separation, not prompt-level obfuscation.

Judge outage. Per-level fail-open: the level goes unscored, a soft violation records the outage, deterministic checks still stand, and an unscored run can never auto-approve. A kill switch halts every LLM path and degrades the gate to its deterministic subset—safety over autonomy.

Calibration drift. Judge prompts carry a version string stamped into every verdict's provenance, and agreement stats recompute continuously from the persisted rows. A drifting judge surfaces as falling agreement—the same metric that enables auto-approve can trigger the rollback. This is the production echo of Agentic CLEAR's own discipline: the dynamic taxonomy that adapts to new behavior also keeps re-checking old behavior, so calibration is monitored rather than assumed.

The Gate Can Swing Back

The autonomy gate is not a one-way door, and that is the whole point of keeping the human infrastructure alive. If auto-approve degrades agreement below 0.80, the system reverts to shadow mode automatically. But the moment you delete the human interrupt entirely, you lose the ground truth that the agreement number is computed against—you can no longer tell whether the machine and the human still agree, because there is no human decision left to compare to.

That is also where Agentic CLEAR points next. It answers "where did the agent fail" at system, node, and trace level, with no reference answers and strong alignment to a human taxonomy it was never shown—a genuinely strong automated analyst. It does not answer "should this specific action ship without me," which is a risk decision, not an analysis one. The honest reading of the 0.890-AUC result is that the verify step is automatable in principle; how much agreement evidence you demand before you let the gate skip you is a policy you set, not a number the paper hands you. Build the multi-level evaluator, wire it in shadow first, demand your threshold of agreement, and keep the human decision rail alive so the gate can always swing back. The eval harness is not overhead on an agent fleet. It is the component that decides how much of the fleet you are willing to stop watching.


References