Agentic CLEAR: Automating Multi-Level Agent Evaluation — and the Autonomy Gate It Unlocks
Every team running an agent fleet has the same blind spot. Observability platforms—MLflow, Langfuse, home-grown OpenTelemetry—capture execution traces beautifully. They show you what the agent did. They say almost nothing about whether it did it well. So a developer opens the trace viewer, scrolls through a few hundred spans, and tries to eyeball a systemic failure out of thousands of runs. The research alternative is worse: hand-built error taxonomies that take weeks to annotate and go stale the moment the agent changes. What both approaches lack is automated multi-level agent evaluation—judgment of the trajectory itself, not just a record of it.
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents, by Yehudai, Eden, and Shmueli-Scheuer (2026) at IBM Research, attacks exactly this gap. It is an open-source Python package—pip install clear-eval—that reads raw agent traces and produces data-driven evaluation at three levels of granularity, surfaces recurring failure patterns without a predefined taxonomy, and renders the whole thing in an interactive dashboard. It reports up to 0.890 AUC for predicting trajectory success in a fully reference-less setting. This post walks through what the paper actually does, then shows how I wired the same multi-level shape into a 45-graph production fleet as an "autonomy gate"—the component that turns a human approval interrupt into a machine one.
