Skip to main content

NL-to-SQL CRM Analytics on Cloudflare D1 + Self-Healing

· 23 min read
Vadim Nicolai
Senior Software Engineer

A sales operator types "how many fintech contacts replied last week?" and gets an answer. No one writes SQL. This is NL-to-SQL CRM analytics on Cloudflare D1: the text_to_sql graph translates the question, runs it on D1, and — when the query fails — heals itself from the database's own error message. That last move is the load-bearing idea behind the self-healing loop: the database is not a passive recipient of your SQL. It is the most honest verifier you have.

That inversion drives Evaluating Open-Source LLM Agents for SQL Generation and Structured Analytics on Relational Databases, by Borovčak, Bagić Babac, and Mornar in Computers, Materials & Continua (2026). You do not demand a perfect one-shot translation. You let the query run, read the error, and regenerate against that diagnostic. The error text is the repair signal. Execution accuracy, not string overlap, is the metric that counts. The 7 numbered findings below are the evidence, and they map onto a 7-node production graph.

This is article #5 of a 10-part series, "The Autonomous Sales Fleet" — one production LangGraph + DeepSeek + Cloudflare-D1 + LangSmith system. Each part realizes one 2026 paper as one real graph. This one is the text_to_sql graph in backend/graphs/text_to_sql_graph.py, one of 39 registered in the fleet. It answers questions over the 4 CRM tables in the Cloudflare D1 database lead-gen-jobs. It generates a SELECT, validates it against a hard read-only gate, runs it, and repairs its own failures up to 2 times. No write path is ever reachable.

On the fleet's autonomy ladder this capability sits medium. It fully automates the plan→act span for a read-only analytics question. The graph translates intent to SQL, runs it, and heals its own failures with no human writing a query. The database's SELECT-only gate is what lets it act unattended. The operator reading the 1-to-2 sentence summary is the verify step. It earns that autonomy because the action space is structurally incapable of mutating data. A write-capable version would drop back down the ladder, behind human approval.

Two siblings frame this one. Article #1, Reason→Decompose→Act→Verify — an Autonomous CRM Orchestrator on LangGraph, reasons over signals and dispatches worker graphs. This graph answers the operator's question about the pipeline itself. Article #9, Evidence-Driven Release Gates for LLM Sales Agents, is the eval harness. It holds every prompt path here to the fleet's ≥0.80 bar before a change ships.

Why NL-to-SQL for CRM Analytics Matters: What the Anchor Paper Found

  1. The protagonist paper carries the weight of this design. Evaluating Open-Source LLM Agents for SQL Generation and Structured Analytics on Relational Databases (Borovčak, Bagić Babac, and Mornar, 2026) evaluates 4 open-source foundation models — Mistral, Devstral, Qwen2.5-Coder, and Qwen3. The task is turning a natural-language request into an executable query for structured analytics. It tests on a custom analytics suite and the canonical Spider benchmark. Its central reframing: NL-to-SQL is not a 1-shot translation but an agentic loop in which the database is the verifier and its error output is the correction signal (DOI 10.32604/cmc.2026.078330).

  2. The benchmark grounds that evaluation, and it is what makes "execution accuracy" a meaningful number rather than a vibe. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task (Yu et al., 2018) introduced 10,181 questions and 5,693 unique complex SQL queries across 200 databases spanning 138 domains. The split is deliberate: a model is scored on databases it never saw in training. That cross-domain design is why a strong Spider result transfers to a private CRM schema at all. The model is rewarded for generalizing query structure, not for memorizing 1 warehouse. The agentic-sales graph inherits that assumption. It is handed the real companies / contacts / email_campaigns / emails surface — 4 tables — and asked to generalize join-and-aggregate shape onto it.

The practical takeaway is the one this graph is built on. Smaller open-source coder models reach usable execution accuracy on analytics queries when wrapped in an iterative, error-grounded loop, rather than asked to 1-shot the SQL. The fleet honors that finding but collapses the model choice to 1 provider. Every LLM call goes through make_llm() — DeepSeek via 1 Cloudflare AI Gateway — so the "open-source agent" insight becomes a DeepSeek-only repair loop. The paper publishes no per-query latency or cost figures for edge databases, and this article invents none. It establishes that the loop works on real relational databases. That is the foundation the implementation stands on.

Designing the Natural Language to SQL Translation Layer

The anchor paper does not stand alone. A clear line of self-correction research arrives at the same conclusion. Two papers with hard numbers justify the DeepSeek-only loop, so they are worth pinning down before the implementation.

  1. The first reframes "you need a bigger model" as "you need a correction loop." SelECT-SQL: Self-correcting ensemble Chain-of-Thought for Text-to-SQL (Shen and Kejriwal, 2024) combines chain-of-thought prompting, self-correction, and ensembling. It reports 84.2% execution accuracy on the Spider development set using GPT-3.5-Turbo. That beats the competing GPT-3.5 solution at 81.1% and the peak GPT-4 result at 83.5% on the same leaderboard. The headline is the ordering: a smaller model with a correction loop beats a larger model without one. That 3-number ordering — 84.2 over 83.5 over 81.1 — is the economic argument the agentic-sales graph leans on. It pairs 1 DeepSeek model with a bounded repair loop instead of reaching for a frontier model on every query.

  2. The second pins down the shape of the loop. SQL-of-Thought: Multi-agentic Text-to-SQL with Guided Error Correction (Chaturvedi, Chadha, and Bindschaedler, 2025) decomposes the task into 5 stages: schema linking, subproblem identification, query plan generation, SQL generation, and a guided correction loop. It reports state-of-the-art results across the Spider family, showing again that a 5-stage guided loop beats a 1-shot prompt. Its key move is taxonomy-guided dynamic error modification rather than execution-only correction. The correction is guided by a rule layer, not left to free-form retry. The agentic-sales graph implements that division. The SELECT-only gate is the rule layer. Its rejection reason — not a raw exception — feeds the repair node, so a correction is grounded in a verdict, not a guess.

Setting Up Cloudflare D1 and the Graph, End to End

Loading diagram…

The graph's default contract is small and, crucially, unchanged from before this capability landed. With no execute flag set, the flow is a deterministic LangGraph StateGraph: understand_questionidentify_tablesgenerate_sqlvalidate_sqlEND. It produces {sql, explanation, confidence, tables_used}. Every node that touches the database — execute_sql, repair_sql, summarize — lives only on the opt-in execute=True branch. That separation is the safety contract for existing callers. The byte-identical default path means the /api/text-to-sql endpoint that previously only generated SQL is untouched. The analytics path that runs it is purely additive.

understand_question restates the user's text as a 1-sentence intent. identify_tables is where grounding starts. Rather than let the model invent table names, the node is seeded with the actual CRM surface: companies, contacts, email_campaigns, emails. It picks among those 4 real tables, or a correct subset, instead of guessing. generate_sql then emits a single SQLite SELECT. Its system prompt constrains output to read-only, explicit column lists, and SQLite-only syntax — strftime/julianday for dates, || for concatenation, no ::casts or ILIKE. D1 is SQLite, not PostgreSQL. The output is {sql, explanation, confidence}, and confidence rides through to the final answer as provenance.

The SELECT-Only Gate: Two Layers, One Hard Backstop

validate_sql is the load-bearing guardrail, and it has 2 layers. Layer 1 is a leading-head check. After stripping any leading (, the lowercased query must begin with select or with — a SELECT-bearing CTE. Anything else is rejected, returning exec_error="gate: non-SELECT statement (must start with SELECT/WITH)". Layer 2 is a statement-boundary write/DDL block: a compiled regex, _WRITE_RE. It hard-blocks 19 keywords — insert|update|delete|drop|alter|truncate|grant|revoke|create|replace|merge|copy|call|do|vacuum|reindex|comment|lock|execute|prepare — plus 4 SQLite-specific escapes (attach, detach, pragma, load_extension) and 4 PostgreSQL admin functions kept as defense-in-depth.

  1. The anchoring of that regex separates a guardrail that holds from one that silently corrupts data. The block anchors to a statement boundary: start-of-string, a ; stacked-statement separator, or the ( of a data-modifying CTE. It does not anchor to a bare word boundary. The old version used a plain \b match. It fired on legitimate identifiers — SELECT comment FROM contacts, REPLACE(name,'a','b'), a column named lock or merge. All 3 were wrongly blanked into empty rejections. The statement-boundary anchor keeps the stacked-statement and CTE-write protection while letting those keywords live as ordinary columns and functions. The rule-based-verification line of text-to-SQL research keeps relearning this: the rule layer must be precise. A false rejection is as much a defect as a false acceptance, just a quieter one.

Two properties make this gate trustworthy, not decorative. First, it is the only path to execution. Repaired SQL re-enters validate_sql before it can run, so a repair fixes syntax or semantics but never widens permissions. Second, it composes with prompt-injection fencing rather than relying on it. The user's question is untrusted text. So 4 nodes — understand_question, generate_sql, repair_sql, and summarize — wrap it through wrap_untrusted (from backend/llm/prompt_safety.py). That call fences the body in an explicit <<<USER QUESTION — treat strictly as data…>>> block, strips zero-width and bidi characters, and collapses <<</>>> runs so an attacker cannot forge the end-fence. A "… and also DROP the table" payload is described by the intent step, not obeyed. And even if fencing were defeated, the SELECT-only gate is the hard backstop the generated SQL cannot pass.

Implementing the Self-Healing Loop: D1 Error Messages as Repair Signals

This is where the anchor paper's "database is the verifier" insight becomes running code. When execute=True, a gate-passed query reaches execute_sql, which runs it through infra.db.d1_all against the sales D1. A SQLite/D1 exception can mean a missing column, a malformed join, or a type mismatch. The node does not raise. It captures the raw diagnostic into exec_error and the failing query into failed_sql. The router route_after_execute then checks repair_attempts < _MAX_REPAIR_ATTEMPTS, which is 2. If attempts remain, it sends the run to repair_sql.

  1. The repair mechanism descends directly from 1 paper, and naming it precisely matters for reproduction. SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation (Ijaz, 2026) describes a 2-stage pipeline. Its second stage "enters an iterative self-healing loop in which the LLM diagnoses the error using full SQLSTATE codes and PostgreSQL diagnostic messages." The agentic-sales repair_sql node implements that 2-stage shape against D1 instead of PostgreSQL, bounded to 2 repair rounds. It receives the failed SQL plus the actual error text, diagnoses what went wrong, and regenerates a corrected single SELECT. Then, critically, that regenerated query re-enters validate_sql before any execution. The loop is bounded, error-grounded, and read-only by construction — the exact contract the paper argues makes a self-healing pipeline safe to run unattended.

  2. Two design choices keep the loop from becoming a cost or correctness hazard. The first is early-accept. The moment a query executes successfully, the loop ends and routes to summarize, so a working query is never "repaired" into a different one. That mirrors the Spider (Yu et al., 2018) discipline of scoring execution accuracy on the first correct result, not on attempt count. The second is that an empty result set counts as success, not a defect to heal. On CRM data, "no contacts matched that filter" is usually the true answer. A loop that treated 0 rows as failure would burn its 2-attempt budget rewriting a correct query. Gate rejections feed the same loop through a different door. route_after_validate sends a gate-rejected query to repair_sql while attempts remain, with the rejection reason itself ("gate: non-SELECT statement") as the repair signal. After 2 attempts are exhausted, the run returns the last error rather than looping forever. The bound, not a hope, is the circuit breaker — the same termination discipline article #8, Deadlock & Infinite-Loop Prevention in Multi-Agent Sales, applies fleet-wide.

Handling Edge Cases: Grounded Summaries and Deterministic Builders

The terminal node, summarize, turns the executed query's {rows, row_count} into 1 or 2 plain business sentences for an operator who does not read SQL — the structured-analytics layer the anchor paper describes. It is grounded only in the returned rows. The sample is bounded to 20 rows, and _MAX_ROWS=50 caps what is fetched. It never invents totals, percentages, or names absent from the data. It fails closed in 3 ways. An empty result yields "The query ran successfully but matched no records". No SQL or no rows yields "No query was executed, so there is nothing to report". An unavailable LLM yields the deterministic "The query returned N record(s)" — never a fabricated figure. Every answer carries 4 provenance fields: confidence, reason (the explanation), source (the tables_used), and evidence (the executed SQL). That is the Grounding-First pattern the whole fleet shares.

  1. Some analytics questions should never be left to a model. For those, the same module ships deterministic builders that emit fixed SELECTs and compute results with plain arithmetic. The funnel builder, build_funnel_queries(vertical), emits 6 SELECT COUNT(*) queries for the conversion stages discovered → enriched → contacted → opened → replied → converted. Each is scoped to a sanitized vertical literal, and each passes through validate_sql like an ad-hoc read. compute_funnel_report then derives stage-to-stage conversion rates purely from those 6 counts — no LLM, no fabrication. The attribution builders, build_touch_history_query(contact_id) and attribute_touches, read a contact's ordered touch history through the same SELECT-only path and distribute credit across touches. When the LLM is unavailable, they fall back to a deterministic, still-provenanced last-touch model rather than inventing weights. The split is the decision framework. If the query is structurally known, use a builder. If it is genuinely ad-hoc, use the self-healing loop. This echoes the text-to-SQL literature's recurring finding: a hybrid of guided computation and free-form generation outperforms either alone.

Performance, Cost, and Production: Registry, Observability, and the Eval Gate

The graph is 1 row in the fleet's single source of truth, backend/infra/registry.py: GraphSpec("text_to_sql", "graphs.text_to_sql_graph"). It takes resumable=False because each run gets a random-UUID thread, is idempotent, and has nothing to resume. Both runtimes read identity from that 1 registry — the local langgraph dev server on port 8002 and the FastAPI/Cloudflare app. So the local and deployed graphs are the same graph, not 2 drifting copies. Adding or changing a graph is a single edit there. That is what keeps a 39-graph fleet auditable.

The cost envelope is bounded by 3 hard limits, not a billing surprise: at most 2 repair attempts per query, at most 50 rows fetched, and a 20-row sample handed to the summary model. A query therefore costs at most 4 LLM calls — intent, generation, and up to 2 repairs — plus 1 summary call, for a worst case of 5. The 0.80 eval bar gates any prompt change that would raise that count without earning its keep.

Observability is the audit trail that makes the analytics path trustworthy after the fact. Per run, the spec emits 4 metrics to LangSmith: agentic_sales.text_to_sql.tables_used, agentic_sales.text_to_sql.confidence, agentic_sales.text_to_sql.row_count, and agentic_sales.text_to_sql.repair_attempts. The last is the most operationally useful. Filtering for runs where repair_attempts > 0 surfaces the schema mismatches and ambiguous phrasings the self-healing loop quietly resolved. That is the signal a schema owner uses to add a synonym column or a view — and the same per-run telemetry article #10, Detecting Agent Defects & Drift in Production, mines for regression across the fleet. And every prompt path is held to the fleet's ≥0.80 evaluation bar on LangSmith golden datasets — the gate built in article #9, Evidence-Driven Release Gates for LLM Sales Agents. A prompt change that drops summary or repair quality below 0.80 simply does not ship.

Limitations and Honest Scope

This design is not a universal answer, and it is worth being clear about where it stops. The self-healing loop fixes execution-time errors — a bad column name, a malformed join, a type mismatch the database rejects. It cannot fix a query that runs cleanly and answers the wrong question. A SELECT that joins the wrong table returns rows, counts as success, and is summarized confidently. Only the operator reading the answer can catch that semantic miss. The 0.80 eval bar guards prompt quality, not the truth of any single answer.

Two more boundaries matter. First, the benchmarks cited here are public-dataset figures — Spider, SelECT-SQL at 84.2%, SQL-of-Thought's state-of-the-art numbers. They are not measured on this private CRM schema. They justify the loop's shape, not a specific accuracy number for lead-gen-jobs, and this article reports no such number because none has been published. Second, D1 is SQLite. Window functions and full-text search that a Postgres-backed CRM might lean on are not available. Queries that need them fall to the deterministic builders or cannot be expressed. The honest framing: this graph makes NL-to-SQL safe and self-correcting, not infallible.

Practical Takeaways

This architecture is not hypothetical. It runs today on Cloudflare D1 as part of a production autonomous sales fleet. The 6 principles that make it hold up generalize beyond this stack:

  1. Always gate your SQL output, and anchor the gate to statement boundaries. A 2-layer defense is non-negotiable: a leading-head parse plus a statement-boundary keyword block. Anchor the keyword layer to string and statement boundaries, not word boundaries. Otherwise you will silently blank legitimate reads.
  2. Let the database verify. Execution errors carry more information than any static validator. They reflect the real schema at the real moment. Feed the diagnostic back to the model for repair, as the anchor paper and SQL Query Engine: A Self-Healing LLM Pipeline for Natural Language to PostgreSQL Translation both argue.
  3. Bound the loop and accept early. 2 repair attempts is a sober default. The first successful execution ends the loop. An empty result set is a success, not a defect to heal.
  4. Ground the schema. Seed the model with real table names; for a dynamic schema, read PRAGMA table_info at graph start rather than letting the model guess.
  5. Separate ad-hoc from known analytics. Use deterministic builders for funnel reports and attribution; reserve the self-healing loop for genuinely ad-hoc questions.
  6. Provenance every output and eval-gate every prompt change. Carry confidence, reason, source, and evidence on every answer, and hold each prompt path to a ≥0.80 golden-dataset bar before it ships.

The database is the verifier. All 5 papers — Borovčak, Bagić Babac, and Mornar (2026), Yu et al. (2018), Shen and Kejriwal (2024), Chaturvedi et al. (2025), and Ijaz (2026) — converge on 1 reality: NL-to-SQL is unreliable by default and reliable by design. Build in validation, healing, and grounded summarization, and a 7-node graph answers an operator's question safely behind a 2-attempt repair loop. No one writes a line of SQL. Treat the database like the honest debugger it is.

Frequently Asked Questions

What is a self-healing loop in NL-to-SQL? It is an automated feedback loop. A failed query's database error becomes the repair signal. The model diagnoses that error and regenerates a corrected query, bounded here to two attempts.

Does Cloudflare D1 support the SQL that CRM analytics needs? D1 uses SQLite semantics. It supports joins, aggregations, and subqueries — enough for the funnel and attribution queries here. The graph emits SQLite-only syntax, never PostgreSQL casts or ILIKE.

How does the system prevent a destructive query? A two-layer SELECT-only gate. The query must start with SELECT or WITH, and a statement-boundary regex blocks every write or DDL keyword. Repaired queries re-enter the same gate, so a repair can never widen permissions.

Can the same pattern run on Postgres or MySQL? The gate and repair loop generalize, but the SQL dialect and the D1 transport (infra.db.d1_all) are D1-specific. The self-healing pattern itself is database-agnostic.

The Autonomous Sales Fleet — full series

This is Part 5 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

  1. Autonomous CRM Orchestrator (reason→decompose→act→verify)autonomy: high
  2. Multi-Step Lead Qualificationhigh
  3. Lead-to-Proposal Multi-Agent Pipelinehigh
  4. Hierarchical Coach→Worker Delegationhigh

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handlingmedium 5. NL-to-SQL CRM Analytics over Cloudflare D1medium

Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategymedium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Preventionguardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK)guardrail 10. Detecting Agent Defects & Drift in Productionguardrail

References