Why is evaluation the bottleneck for autonomous knowledge graphs?

Every edge inserted, relationship inferred, and hypothesis proposed can be wrong, and the only way to know is to verify — but a single LLM judge has inconsistent calibration across domains. The 2026 literature shows verification is itself becoming agentic: the evaluator must be as sophisticated as the generator.

What is agent-as-judge evaluation?

A Judge Agent scores a candidate edge against the supporting sub-graph, typically alongside a deterministic rule engine that checks schema and cardinality first. Edges clearing a high bar (0.80) are committed; the rest are routed to debate or rejected. SAGE formalizes this judge-plus-rule-engine pattern.

How does multi-agent debate help, and how can it backfire?

For contested edges, two agents argue opposing positions and a moderator decides, which surfaces evidence a single judge misses. But naive debate can amplify error when agents share a base model and bias — so debate needs a strict grounding constraint (cite exact node IDs), a bounded number of rounds, and a moderator confidence threshold.

What is autonomous discovery over a knowledge graph?

Discovery samples concept pairs connected at two hops with no direct edge and proposes the plausible missing relationship — for example whether one curriculum concept is a prerequisite of another — constrained to the graph structure to suppress hallucinated links. Each candidate then runs through the same evaluate-debate-abstain pipeline.

Why is abstain-under-uncertainty the default?

It is better to miss a valid edge than to insert a hallucinated one that pollutes every downstream query. When the judge or moderator cannot reach the confidence bar, the edge is logged for human review rather than committed — the graph must be trustworthy first and complete second.

What is a bi-temporal knowledge graph for agent memory?

It is an agent's long-term memory stored as a graph where every edge carries two timestamps: valid_at (when the fact held in the world) and recorded_at (when the agent learned it). A superseded fact is stamped invalid_at rather than deleted, so the memory is a revision-conscious archive instead of a flat log.

Why do flat logs and vector stores fail as agent memory?

A flat log records when something was said but offers no structure for reasoning across entities; a vector store finds semantically similar memories but has no notion of sequence or validity, so it cannot answer 'what changed between last week and today.' Long context dilutes attention across irrelevant history. Agents need structured, temporal, traceable memory — which graphs natively provide.

Why store two timestamps instead of one?

If a lesson reframes a concept on the 1st but the agent ingests the change on the 5th, only a bi-temporal graph can answer 'what did the agent believe on the 3rd?' — the old relationship, because the new fact had not been learned yet. valid_at captures world time; recorded_at captures ingestion time; the gap between them is where belief revision lives.

How does the memory avoid unbounded growth without deleting?

Consolidation runs asynchronously with two levers: salience decay lowers the priority of edges that are not retrieved over time, and subsumption stamps invalid_at on a fact when a later valid_at supersedes it. Nothing is erased — superseded edges move to cold storage and stay queryable for audit and point-in-time questions.

When should an agent use graph memory over vector memory?

Use graph memory when the agent must reason multi-hop across concepts, answer temporal point-in-time questions, and keep an audit trail — for example over an AI-engineering curriculum that is revised across many sessions. Vector memory is better for pure semantic similarity at very high write rates; long context suits a single static read.

What is a self-healing knowledge graph?

A self-healing knowledge graph runs a background loop that detects defects in the stored graph and fixes the graph itself — invalidating clear errors and quarantining the ambiguous ones — rather than only patching a downstream answer. The 2026 canonical scaffold is detect, repair, then inconsistency-tolerant reasoning.

Why is repairing the graph different from repairing the answer?

Answer-side filters patch a single response without touching the store, so the same wrong fact keeps triggering errors on every future query. Repairing the stored graph compounds its benefit across all future queries, which is where long-term data quality lives.

Does the loop ever delete data?

No. Every detected issue is handled by stamping invalid_at and setting status to invalidated, so the edge becomes non-queryable but stays in concept_edges for audit and possible reinstatement. Invalidation, not deletion, is the rule — the repair sweep has no hard-delete path.

How does the loop avoid making things worse?

It only auto-acts on unambiguous defects: prerequisite cycles and ungrounded edges below the 0.6 provenance floor are invalidated, while genuine contradictions between two plausible edges are quarantined for a human rather than guessed. And invalidation is soft — invalid_at is stamped, never a hard delete — so any wrong call is reversible on a later sweep.

What is agentic GraphRAG?

Agentic GraphRAG replaces one-shot subgraph retrieval with a planning agent that traverses the knowledge graph step by step — deciding which concept to expand next, focusing the query each round, and synthesizing an answer from the accumulated sub-graph — until it can answer or must abstain. The 2026 corpus treats graph traversal as a sequential decision process rather than a single retrieval.

Why is multi-hop GraphRAG a planning problem, not a retrieval problem?

A multi-hop question requires sequential choices — which edge to follow, which node to expand, when to backtrack. A one-shot retrieval pass cannot make those choices: it returns a tangled subgraph and hopes the LLM resolves the chain. A planning agent makes each hop conditioned on what the previous hop found.

How does the agent avoid retrieving noise?

Each round expands a frontier of active concept-edge neighbours capped at 12, the query is focused on what the previous hop found, and explored concepts are forbidden from being revisited — so the frontier does not degrade into noise. The agent abstains after 4 rounds if confidence never clears 0.80.

When does the agent abstain?

An answer is emitted only if accumulated confidence reaches the 0.80 bar within 4 rounds; otherwise the agent abstains rather than guess. Abstention trades coverage for safety, which is the right default for explanation-critical answers over the curriculum graph.

Where do RL-trained graph agents fit?

The 2026 RL papers — GraphDancer, AgentGL, GraphScout, TKG-Thinker, HyperGraphPro — learn traversal policies that dynamically adjust the frontier and recover from dead ends. This design approximates those benefits with fixed parameters and no training; fine-tuning on traversal trajectories is the deferred next step.

What is autonomous knowledge graph construction?

It is the pattern where one agent loop owns the full lifecycle of a knowledge graph — reading a source, searching the existing graph, verifying a candidate fact against evidence, and writing it with create/update/retract operations — instead of a one-shot batch extraction. The 2026 RAGA framework formalizes this as a Read-Search-Verify-Construct loop over a CRUD toolset.

How is an agentic KG builder different from a batch extraction pipeline?

A batch pipeline extracts triples from each document independently and merges them later, so it cannot consult the graph already built while it decides. An agentic builder is stateful: each write is a function of the current graph, so it deduplicates, reconciles a contradiction, or refuses an ungroundable fact before the write lands — not in a downstream cleanup pass.

How does evidence anchoring prevent hallucinated triples?

Every candidate triple must carry a source span and a confidence score. A triple below the 0.6 confidence gate, or with no retrievable evidence span, is not written — it is held for review. This makes every edge auditable, but it verifies provenance, not truth, which is why the design ships advisory-by-default.

Does the builder ever delete data?

No. The mutation protocol exposes create, update, and invalidate operations; invalidate stamps invalid_at and sets status to invalidated on the prior version rather than hard-deleting it. A contradicted edge is invalidated, never erased, so the graph keeps a full audit trail and supports rollback.

Where does autonomous construction fit in the AI-engineer roadmap?

It turns the curriculum's lesson markdown into a queryable concept graph that learners and downstream agents read. Because every edge is anchored to an evidence span in a lesson and confidence-scored, the tutoring and recommendation agents built on top can explain why one concept is a prerequisite for another, not just assert it.

What is a self-healing loop in NL-to-SQL?

It is an automated feedback loop. A failed query's database error becomes the repair signal. The model diagnoses that error and regenerates a corrected query, bounded here to two attempts.

Does Cloudflare D1 support the SQL that CRM analytics needs?

D1 uses SQLite semantics. It supports joins, aggregations, and subqueries — enough for the funnel and attribution queries here. The graph emits SQLite-only syntax, never PostgreSQL casts or ILIKE.

How does the system prevent a destructive query?

A two-layer SELECT-only gate. The query must start with SELECT or WITH, and a statement-boundary regex blocks every write or DDL keyword. Repaired queries re-enter the same gate, so a repair can never widen permissions.

Can the same pattern run on Postgres or MySQL?

The gate and repair loop generalize, but the SQL dialect and the D1 transport (infra.db.d1_all) are D1-specific. The self-healing pattern itself is database-agnostic.

6 posts tagged with "Cloudflare D1"

Cloudflare's serverless SQLite database (D1) as the data plane for agentic apps, analytics, and NL-to-SQL.

View All Tags

Closing the Loop: Evaluation, Debate, and Discovery

July 3, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

The most stubborn bottleneck in autonomous knowledge graphs is not retrieval accuracy or latency — it is evaluation. Every edge inserted, every relationship inferred, every hypothesis proposed can be wrong, and the only way to know is to verify. But verification is itself becoming an agentic problem, and the 2026 literature is blunt about it: the evaluator must be as sophisticated as the generator. The question is no longer whether to close the loop but how — and the answer is a layered design that combines a deterministic rule engine, an agent-as-judge, multi-agent debate for contested edges, and autonomous discovery, all gated by a hard abstain-under-uncertainty rule.

This is article #5, the final guardrail in the Autonomous Knowledge Graphs series. It closes the loop over the graph that #1 builds, #2 reasons over, #3 repairs, and #4 remembers. Every design in the series obeys the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step. The worked example throughout is the AI-engineer curriculum concept graph — concepts linked by prerequisite, builds_on, contrasts_with, part_of, related, and applies_to. Here the loop runs with a ≥ 0.80 commit bar on every edge and grounding-first provenance throughout.

The Graph as Agent Memory

July 2, 2026 · 15 min read

Vadim Nicolai

Senior Software Engineer

The graph as agent memory rejects the notebook metaphor. A notebook remembers what you wrote, but not when you believed it, nor when the fact itself was true. Flat vector stores and long-context transformers collapse time into a single present, and an agent that cannot distinguish "I knew this yesterday" from "this is still true today" is not reasoning — it is repeating. A bi-temporal knowledge graph — one that records both valid_at (when the fact held in the world) and recorded_at (when the agent ingested it) — turns memory from a static log into a navigable, revision-conscious archive where nothing is deleted and facts are superseded by stamping invalid_at.

This is article #4 in the Autonomous Knowledge Graphs series. The AI-engineer curriculum concept graph from #1 doubles as the agent's long-term, revision-conscious memory of the curriculum as it evolves across months of sessions, under the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step.

Self-Healing Knowledge Graphs: Graphs That Fix Themselves

July 1, 2026 · 15 min read

Vadim Nicolai

Senior Software Engineer

Provenance is not truth. A triple can be perfectly traced to a published source and still be wrong — contradicted by a later signal, inconsistent with the schema, or hallucinated by the model that extracted it. The industry has spent years building better provenance; the harder problem is what to do when provenance says the fact is sourced but the fact is still garbage. The sharpest 2026 statement of this is TGComplete, which finds that most gold-correct edges have no supporting passage even under exhaustive retrieval — so textual verification measures provenance, not correctness (Kang et al., 2026, arXiv:2606.15833).

This is article #3 in the Autonomous Knowledge Graphs series, and it is a guardrail. Where #1 builds the curriculum concept graph and #2 reasons over it, this article keeps it accurate over time. Every design in the series obeys the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step. This guardrail runs as a background repair sweep over the stored concept graph.

Reasoning Over the Graph: From GraphRAG to Planning Agents

June 30, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

Agentic GraphRAG treats the knowledge graph not as a static index to retrieve from once, but as a state space to reason over one node at a time. GraphRAG proved that structured knowledge could be retrieved at generation time — but a one-shot subgraph either drowns the LLM in irrelevant triples or misses the one critical edge. A question like "what must a learner master before agent orchestration, and which of those concepts does RAG build on?" is a sequence of decisions: which edge to follow, which concept to expand, when to backtrack. That is a planning problem, and the 2026 research corpus has converged on agentic traversal to solve it.

This is article #2 in the Autonomous Knowledge Graphs series. It reasons over the curriculum concept graph that article #1 builds, and obeys the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step. The worked example is an explainable answer over the curriculum graph: the agent returns not just an answer but the supporting concept sub-graph as evidence.

Autonomous Knowledge Graph Construction: Graphs That Build Themselves

June 29, 2026 · 17 min read

Vadim Nicolai

Senior Software Engineer

Autonomous knowledge graph construction is the pattern where one agent loop owns the entire lifecycle of a graph — read a source, search what is already known, verify a candidate fact, then write it — instead of running a one-shot batch extraction and hoping a later merge step cleans up the mess. The cleanest 2026 formulation is RAGA, which gives an LLM agent a CRUD toolset over the graph and constrains it with a Read-Search-Verify-Construct loop (Han & Cheng, 2026, arXiv:2605.17072).

This is the first article in a new series, Autonomous Knowledge Graphs, a connected five-part arc — from human-curated graphs up to graphs that build, reason, repair, remember, and evaluate themselves. Every design in the series obeys the same engineering constraints: a control plane built on LlamaIndex — DeepSeek as the LLM client, its PropertyGraphIndex for retrieval — with the autonomous loop itself written in plain Python rather than run by a workflow or graph-orchestration engine, over a Cloudflare D1 concept-graph data plane (concepts, concept_edges, lesson_concepts), with a thin TypeScript layer applying every write; DeepSeek-only model egress through one Cloudflare AI Gateway; a grounding-first record on every write — {confidence, reason, source, evidence} with bi-temporal valid_at/recorded_at stamps; and invalidate-not-delete at every irreversible step. The worked example throughout is the AI-engineer curriculum concept graph — concepts linked by prerequisite, builds_on, contrasts_with, part_of, related, and applies_to.

NL-to-SQL CRM Analytics on Cloudflare D1 + Self-Healing

June 25, 2026 · 22 min read

Vadim Nicolai

Senior Software Engineer

A sales operator types "how many fintech contacts replied last week?" and gets an answer. No one writes SQL. This is NL-to-SQL CRM analytics on Cloudflare D1: the text_to_sql graph translates the question, runs it on D1, and — when the query fails — heals itself from the database's own error message. That last move is the load-bearing idea behind the self-healing loop: the database is not a passive recipient of your SQL. It is the most honest verifier you have.

That inversion drives Evaluating Open-Source LLM Agents for SQL Generation and Structured Analytics on Relational Databases, by Borovčak, Bagić Babac, and Mornar in Computers, Materials & Continua (2026). You do not demand a perfect one-shot translation. You let the query run, read the error, and regenerate against that diagnostic. The error text is the repair signal. Execution accuracy, not string overlap, is the metric that counts. The 7 numbered findings below are the evidence, and they map onto a 7-node production graph.

This is article #5 of a 10-part series, "The Autonomous Sales Fleet" — one production LangGraph + DeepSeek + Cloudflare-D1 + LangSmith system. Each part realizes one 2026 paper as one real graph. This one is the text_to_sql graph in backend/graphs/text_to_sql_graph.py, one of 39 registered in the fleet. It answers questions over the 4 CRM tables in the Cloudflare D1 database lead-gen-jobs. It generates a SELECT, validates it against a hard read-only gate, runs it, and repairs its own failures up to 2 times. No write path is ever reachable.

On the fleet's autonomy ladder this capability sits medium. It fully automates the plan→act span for a read-only analytics question. The graph translates intent to SQL, runs it, and heals its own failures with no human writing a query. The database's SELECT-only gate is what lets it act unattended. The operator reading the 1-to-2 sentence summary is the verify step. It earns that autonomy because the action space is structurally incapable of mutating data. A write-capable version would drop back down the ladder, behind human approval.

Two siblings frame this one. Article #1, Reason→Decompose→Act→Verify — an Autonomous CRM Orchestrator on LangGraph, reasons over signals and dispatches worker graphs. This graph answers the operator's question about the pipeline itself. Article #9, Evidence-Driven Release Gates for LLM Sales Agents, is the eval harness. It holds every prompt path here to the fleet's ≥0.80 bar before a change ships.