Skip to main content

Reasoning Over the Graph: From GraphRAG to Planning Agents

· 13 min read
Vadim Nicolai
Senior Software Engineer

Agentic GraphRAG treats the knowledge graph not as a static index to retrieve from once, but as a state space to reason over one node at a time. GraphRAG proved that structured knowledge could be retrieved at generation time — but a one-shot subgraph either drowns the LLM in irrelevant triples or misses the one critical edge. A question like "which board members at our lead's parent company also sit on our other accounts' boards?" is a sequence of decisions: which edge to follow, which node to expand, when to backtrack. That is a planning problem, and the 2026 research corpus has converged on agentic traversal to solve it.

This is article #2 in the Autonomous Knowledge Graphs series. It reasons over the lead and account graph that article #1 builds, and obeys the same fleet constraints: a LangGraph control plane, a Cloudflare D1 data plane, DeepSeek-only egress, grounding-first provenance, a ≥ 0.80 eval bar on every prompt path, and draft-first human approval. The worked example is an explainable lead-and-account recommendation: the agent returns not just an answer but the supporting sub-graph as evidence.

Loading diagram…

The One-Shot Ceiling

A vanilla GraphRAG pipeline retrieves a subgraph in a single pass — often a community summary of a few dozen nodes. For "what acquisitions did our lead's parent company make last year?" that subgraph may contain the right path, but tangled with unrelated subsidiaries, historical partnerships, and product edges. The LLM must extract the answer from a noisy context. GraphRAG has no decision loop: it retrieves once and hopes the model resolves the chain. Multi-hop reasoning needs sequential decisions, and that is exactly what the 2026 traversal literature supplies.

Agentic Traversal: The 2026 Foundation

The corpus provides the blueprint. GraphSearch proposes a graph-aware query planner that recursively expands subgraphs, treating each expansion as an action conditioned on the current query state (Liu et al., 2026, arXiv:2601.08621). S-Path-RAG combines semantic shortest-path traversal with an iterative LLM loop to reduce the search space to the most promising paths (Fu et al., 2026, arXiv:2603.23512). Both argue traversal should be adaptive: the next hop depends on what the previous hop found.

The reinforcement-learning thread pushes further. GraphDancer uses two-stage curriculum post-training — single-hop navigation first, then multi-hop reasoning in a Think-Act-Observe loop (Bai et al., 2026, arXiv:2602.02518). GraphWalker introduces a synthetic-trajectory curriculum that teaches the agent to reflect on mistakes and recover from invalid paths (Xu et al., 2026, arXiv:2603.28533). AgentGL applies graph-conditioned curriculum RL that gradually widens the exploration horizon (Sun et al., 2026, arXiv:2604.05846); GraphScout gives the LLM an intrinsic exploration bonus so it avoids dead ends (Ying et al., 2026, arXiv:2603.01410); and TKG-Thinker adds a temporal dimension, learning to prefer recent edges over stale ones (Jiang et al., 2026, arXiv:2602.05818). These are method papers; none reports a benchmark in the grounding corpus used here, so the design parameters below are stated as design choices, not borrowed numbers.

Reference Architecture: A Planning Loop on LangGraph

ProGraph-R1 contributes the load-bearing idea: progress-aware reward reshaping that credits the agent for stepping toward the answer at each hop, not only at the end (Park et al., 2026, arXiv:2601.17755). This design maps that progress logic into a LangGraph state machine without RL training — the reward becomes an explicit per-round confidence check.

The state tracks the current query, the visited node IDs, the frontier set (capped at 12 nodes), the accumulated evidence sub-graph, and a scalar confidence in [0, 1]. Each round:

  • The agent node makes exactly 1 LLM call to rewrite the query from the prior round's evidence, instructed not to revisit explored nodes.
  • Two tool nodes run in parallel: graph traversal expands the immediate neighbours of frontier nodes; vector recall runs similarity search over node attributes.
  • A fuser combines traversal relevance with vector cosine similarity and retains a candidate only if the combined score clears 0.70.
  • The verify node computes confidence: at ≥ 0.80 it emits a recommendation plus the supporting sub-graph; otherwise control returns to the agent, up to 4 rounds, then abstains.

Total LLM invocations per query run roughly 2–5 — a deliberate 2–5× multiplier over a single-shot GraphRAG call, traded for reliability on multi-hop queries. The exact token cost depends on the graph and query mix and is not asserted here.

The Loop in Detail

Round 1 seeds from a vector search (top 3–5 matching nodes), expands their neighbourhood, and rewrites the query to focus the next hop. Round 2 expands the new frontier, the fuser drops below-threshold neighbours, and if the evidence sub-graph already contains a clear path the agent can exit early at ≥ 0.80. Rounds 3–4 are reached only for ambiguous or many-hop queries, each rewrite narrowing the relation types followed. The output is a recommendation — "prioritise outreach to the board member who sits on three of your top accounts" — together with the sub-graph that justifies it.

Where RL-Trained Graph Agents Take This

This prompt-based loop approximates some RL benefits through the fixed 12-node frontier cap and the 0.70 fusion threshold. A trained agent would instead learn to adjust the frontier and threshold dynamically by graph density (AgentGL), recover from invalid paths (GraphWalker), and decay confidence by edge age (TKG-Thinker). The next logical step is to fine-tune on traversal trajectories with ProGraph-R1-style reward reshaping — which needs a training pipeline most teams do not have, so the prompt-based loop is the practical default until then.

Failure Modes and Mitigations

  1. Over-traversal. A generic seed can flood the frontier with noise; the mitigation is a high similarity threshold for seed selection, passing only the top 3–5 seeds to traversal.
  2. Loops. The agent can revisit a node via different paths; the state tracks visited IDs and the rewrite prompt forbids returning to explored nodes.
  3. Cost growth. All 4 rounds means up to 5 LLM calls per query; the early-exit at ≥ 0.80 after round 2 is the primary cost control. Concrete dollar figures depend on token usage and are not claimed here.
  4. Abstention bias. The 0.80 bar is conservative; on genuinely ambiguous queries confidence may never reach it, and the agent abstains. That is the right trade for high-stakes recommendations, less so for casual lookups.

Numbered Limitations

  1. Prompt dependency. Rewrite and scoring quality are determined by the prompt; it must be tested on a held-out evaluation set before trust.
  2. Graph quality. The agent optimises over the given graph and cannot infer missing edges; S-Path-RAG-style attribute shortest-paths (Fu et al., 2026) are one way to bridge gaps, not implemented here.
  3. Scalability ceiling. At most 48 nodes are examined per query (4 rounds × 12), a tiny fraction of a large graph; very long ownership chains exhaust the round budget.
  4. No persistent memory. Unlike the bi-temporal graph memory of Engram (Wang, 2026, arXiv:2606.09900), this design forgets between queries and cannot reuse prior exploration — the subject of article #4.
  5. Scalar evaluation. The 0.80 bar is one number; ProGraph-R1's step-level signal is richer but cannot be applied retroactively to a frozen model.

Decision Table: GraphRAG vs Agentic Traversal vs Hybrid

ScenarioRecommended approachWhy
High-throughput single-hop fact lookupGraphRAG community summaryone retrieval is enough; the 2–5× multiplier is wasted
Multi-hop but predictable enterprise queriesHybrid planning loop (this design)planning without training; abstention prevents hallucination
Frequently-changing graph, accuracy ceiling mattersRL-trained traversal (AgentGL / GraphDancer)a learned policy adapts the frontier dynamically
Regulated, high-stakes recommendationsHybrid with the 0.80 abstain gateabstain-under-uncertainty is safer than a confident guess

For the fleet's lead-and-account questions — multi-hop, predictable, explanation-critical — the hybrid planning loop is the right default.

Closing

GraphRAG was the necessary foundation: it proved structured knowledge could be retrieved at generation time. The 2026 corpus — GraphSearch, S-Path-RAG, GraphDancer, GraphWalker, AgentGL, GraphScout, TKG-Thinker, ProGraph-R1 — pushes past it by treating traversal as a sequential decision process. The design here is the deployable compromise: it brings planning to GraphRAG without RL training, and it abstains rather than hallucinate when the path is unclear. The graph stops being an index and becomes something the agent reasons over, one node at a time.

Frequently Asked Questions

What is agentic GraphRAG? It replaces one-shot subgraph retrieval with a planning agent that traverses the graph step by step — deciding which node to expand next, evolving the query each round, and fusing graph traversal with vector recall — until it can answer or must abstain.

Why is multi-hop GraphRAG a planning problem, not a retrieval problem? A multi-hop question requires sequential choices: which edge to follow, which node to expand, when to backtrack. A one-shot pass cannot make those choices; a planning agent makes each hop conditioned on what the previous hop found.

How does the agent avoid retrieving noise? Each round expands a frontier capped at 12 nodes, and a fuser keeps a candidate only if its combined graph-plus-vector score clears 0.70. The query is rewritten each round and explored nodes cannot be revisited.

When does the agent abstain? An answer is emitted only if confidence reaches the 0.80 eval bar within 4 rounds; otherwise it abstains rather than guess — the right default for high-stakes recommendations.

Where do RL-trained graph agents fit? The 2026 RL papers learn traversal policies that adjust the frontier dynamically and recover from dead ends. This design approximates that with fixed parameters and no training; fine-tuning on traversal trajectories is the deferred next step.

Autonomous Knowledge Graphs — the series

  1. Autonomous Knowledge Graph Construction: Graphs That Build Themselves (autonomy: high)
  2. Reasoning Over the Graph: From GraphRAG to Planning Agents (this article — autonomy: high)
  3. Self-Healing Knowledge Graphs: Graphs That Fix Themselves (guardrail)
  4. The Graph as Agent Memory (autonomy: medium)
  5. Closing the Loop: Evaluation, Debate, and Discovery (guardrail)

A companion thread to The Autonomous Sales Fleet. Next: #3 Self-Healing Knowledge Graphs.

References

  • Jiajin Liu et al. GraphSearch: Agentic Search-Augmented Reasoning for Zero-Shot Graph Learning. 2026. arXiv:2601.08621. https://arxiv.org/abs/2601.08621
  • Jinyoung Park et al. ProGraph-R1: Progress-aware Reinforcement Learning for Graph Retrieval Augmented Generation. 2026. arXiv:2601.17755. https://arxiv.org/abs/2601.17755
  • Yuyang Bai et al. GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training. 2026. arXiv:2602.02518. https://arxiv.org/abs/2602.02518
  • Zihao Jiang et al. TKG-Thinker: Towards Dynamic Reasoning over Temporal Knowledge Graphs via Agentic Reinforcement Learning. 2026. arXiv:2602.05818. https://arxiv.org/abs/2602.05818
  • Yuchen Ying et al. GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning. 2026. arXiv:2603.01410. https://arxiv.org/abs/2603.01410
  • Rong Fu et al. S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering. 2026. arXiv:2603.23512. https://arxiv.org/abs/2603.23512
  • Shuwen Xu et al. GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum. 2026. arXiv:2603.28533. https://arxiv.org/abs/2603.28533
  • Yuanfu Sun et al. AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning. 2026. arXiv:2604.05846. https://arxiv.org/abs/2604.05846
  • Liuyin Wang. Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents (Engram). 2026. arXiv:2606.09900. https://arxiv.org/abs/2606.09900