Lead Qualification Sequence: Chatbot to Sales Agent

Q: When should the system hand a lead to a human?

Always — every outreach is draft-first. The orchestrator's compose node produces a held draft (status: draft) reached only after a preview confirm-before-mutate interrupt, and nothing sends without human approval. There is no score-tier gate that lets the agent send on its own inside this graph.

Q: How do you keep the agent from sending to someone who opted out?

The deterministic veto. The safety_gate node runs before the planner and short-circuits to END on any central-suppression, do_not_contact, bounced, unsubscribed, or replied hit. No task plan the LLM proposes can override that — the spec requires action == skip in those cases regardless of the plan.

June 24, 2026 · 26 min read

Vadim Nicolai

Senior Software Engineer

From Scripted Chatbot to Multi-Step Sales Agent: How to Build a Lead Qualification Sequence That Works

A multi-step lead qualification agent earns its autonomy by sequencing work no human queued: it decomposes an inbound signal into an ordered plan, grades each step against real data, and stops at a human-approval interrupt before anything ships. That is the line between a scripted chatbot and an agent — not a newer model or a sharper prompt, but a decision about who gets to sequence work. A chatbot automates a single turn; an agent automates the workflow that turn belongs to. On the fleet's autonomy ladder this capability sits high: it takes over the human plan step for an inbound lead — deciding which qualification and analysis tasks to run, and in what order — while every act stays a draft held for human verify.

The autonomy guard here is conservative by construction. The agent never sends; it composes, and the message is held as a pending draft behind a confirm-before-mutate interrupt, with a deterministic safety veto sitting upstream of the planner so a hostile or malformed plan can never reach a suppressed contact. That is the posture this article builds: reasoning is delegated, action is gated. Article #1's orchestrator dispatches into this qualifier; this is where the fleet first replaces a rep's "is this lead worth my time, and what do I do next?" judgement with a graded, auditable, draft-first sequence.

This is article #2 in The Autonomous Sales Fleet, a connected series describing one production agentic-sales system where each piece adds exactly one capability. The fleet shares a single architecture: a control plane of LangGraph StateGraphs, a data plane on Cloudflare (D1, Workers, Queues), and an observability plane of LangSmith tracing with per-graph golden datasets. Every LLM call exits through one DeepSeek endpoint behind a Cloudflare AI Gateway; no graph ships unless its golden dataset passes an eval gate; every persisted AI decision carries a four-field provenance record; and outreach is always draft-first, held for human approval. This article builds on The Autonomous CRM Orchestrator on LangGraph (#1) and connects forward to the Lead-to-Proposal Multi-Agent Pipeline (#3), which takes the qualified lead as a conceptual starting point.

The strongest evidence for constraining an agent the way this one does comes from AgentArch (Bogavelli, Sharma & Subramani, 2025), a benchmark of 18 agentic configurations across orchestration, prompt strategy, memory, and thinking-tool usage. It finds "significant model-specific architectural preferences" that break the one-size-fits-all assumption, with top models clearing only 35.3% of the complex enterprise task and 70.8% of the simpler one. When even the best configuration fails two of three hard tasks, an open-ended agent loop is a liability — and a closed, typed, narrow planner is the defensible bet. That is precisely the change this article walks through in a real email_orchestrator graph. Industry framing pieces such as Rai (2026) draw the same chatbot-versus-agent line conceptually; the engineering case rests on the indexed and canonical work cited below.

Why Scripted Chatbots Fail to Qualify Leads

Every implementation detail below — node names, scores, thresholds, the feature flag — is read directly from the production agentic-sales codebase (backend/graphs/email_orchestrator_graph.py, backend/graphs/score_contact_graph.py) and its AA04 specification. These values are first-party parameters of this deployment, tuned with the fleet's own eval-gated, grounded-pipeline discipline and LLM-as-judge harness; they are not figures borrowed from any cited paper, and I label them as such once here so the rest of the article can stay readable.

In the orchestrator built for this series — the same email_orchestrator LangGraph StateGraph documented in #1 — the scripted router lives in decide_action. It is an if/elif ladder with exactly four branches: reply if there is an unanswered inbound message; initial if no prior send exists (sequence_number == 0); skip with reason too_soon if the last send is younger than FOLLOWUP_DAYS = 3 days; and followup otherwise.

That ladder is fast, deterministic, and idempotent — and incapable of reasoning about why a lead should advance. It collapses every lead into one of four buckets based on timing alone, ignoring whether the prospect replied with budget authority or a polite "not now." The shift the literature describes — from scripted responses to contextual understanding (Chellappan, 2024) — is not a feature add; it requires a different architecture that maintains state and adapts across turns. In this orchestrator, decide_action is the ceiling: three timing checks that cannot adapt to signal quality.

The practical cost is concrete. A lead with a known company but missing seniority lands in the same bucket as a lead with neither company nor role — both get followup after the cooldown. The ladder has no vocabulary for partial qualification, no memory of prior qualification attempts, no scoring layer, and no ability to branch on data quality. A scripted bot mimics understanding without tracking context; the orchestrator's scripted path does the same with routing.

The Multi-Step Lead Qualification Framework

Gated behind a feature flag, SALES_SUPPORT_SEQUENCER_REASONING, the agentic path inserts a reasoning sequencer, plan_tasks, between the recall_memory node and the deterministic routing fork. The whole orchestrator is a LangGraph StateGraph, so adding the sequencer is a matter of inserting a node and an edge. When the flag is off, the graph is byte-identical to the scripted baseline — a hard acceptance criterion in the AA04 spec. When the flag is on, plan_tasks asks DeepSeek (make_llm(tier="standard")) to emit a typed, ordered task sequence over a closed four-stage vocabulary, {qualify, analyze_opportunity, compose, skip} — the interleaved reason-then-act loop of ReAct (Yao et al., 2023) narrowed to a closed vocabulary, and exactly the kind of constrained, function-calling-style plan that AgentArch (Bogavelli et al., 2025) shows outperforms an open agent loop on enterprise reliability. The "self-taught tool use" line of Toolformer (Schick et al., 2023) is the same instinct one rung lower: decide which call to make and when, rather than narrate.

The prompt receives a _signal_bundle containing hydrated contact, company, and company_facts rows plus prior thread history. Untrusted inbound and scraped text is fenced through wrap_untrusted so a hostile prompt cannot rewrite the plan — the prompt-injection guardrail described in the OWASP LLM Top 10 (LLM01). Any stage outside the four-item vocabulary is dropped by _repair_task_plan before it can shape routing. That is a structural guardrail, not a soft warning: an out-of-vocabulary stage never survives the repair step.

Two graded decision functions ride alongside the planner — and here precision matters, because they are functions whose verdicts travel in graph_meta, not graph dispatch nodes. This deterministic-grading stance is the direct answer to the consistency gap τ-bench measures (under 50% success, sub-25% pass^8): the qualify function produces a verdict from deterministic Cloudflare D1-hydrated signals with zero new LLM calls: a base of 0.4, plus 0.3 if the company is known, plus 0.3 if a role or seniority level is known. A score of 0.6 or above yields qualified; below it yields needs_review. The verdict is persisted as a Grounding-First record with four fields — confidence, reason, source, evidence — each back-linked to the D1 rows that produced it, and traced through LangSmith for the golden-dataset eval gate. The planner proposes the order of work; the grade comes from arithmetic over real rows.

Loading diagram…

The diagram shows the real compiled topology: safety_gate runs four nodes upstream of plan_tasks and can short-circuit straight to END, the planner sits between recall_memory and the deterministic decide_action router, and every path that reaches compose first passes the preview confirm interrupt. (For clarity it folds the plan_actions/plan_roles planner-and-team nodes into the decide_action → preview edge; those are the AA01 and AA06 nodes that the merged graph wired in place of the spec's originally-planned routing.)

Mapping Lead Qualification to Sales Readiness Stages

The second decision function, analyze_opportunity, follows the same pattern as qualify: its confidence is 0.5 + 0.1 × fact_count (capped), sourced to the companies and company_facts tables. It summarizes the opportunity from the enrichment rows already hydrated into state, again carrying the four-field provenance record rather than a free-text claim. Both functions exist so the LLM can propose the order of work while the grades come from deterministic arithmetic over real rows — verdicts that travel in state and feed the eval gate, not nodes that fork the graph.

Sales-readiness tiering lives one graph over, in the fleet's sibling score_contact_graph — a separate, independently compiled scorer, not a downstream node the orchestrator hands off to (the two graphs share no edge or import). It is the fleet's scoring discipline made concrete: a four-term weighted composite per vertical. The terms are seniority (rule-based from a title and seniority table), role_fit (one DeepSeek inference against the vertical description), reachability (rule-based from an authority score and LinkedIn presence), and a propensity sub-score; the four weights are renormalized to sum to 1.0. Tier thresholds are A at 0.80 or above, B at 0.60 or above, C at 0.40 or above, and D below 0.40. If the DeepSeek role-fit call fails, the score source is honestly relabeled rule_based_fallback — a degraded score is never dressed up as model-grounded — and in batch mode equal scores break ties by ascending contact_id for determinism.

This is the "sequencing work" the title promises. Within the orchestrator, the planner orders the qualification and analysis steps and grades them; the sibling scorer turns a contact's commercial readiness into one auditable number rather than a chat transcript a human must re-read. Each step is a typed stage; each decision carries evidence. The intelligence is in the structure — the four-stage vocabulary, the qualification cut line, the tier thresholds — not in any single prompt. That is the τ-bench (Yao et al., 2024) lesson stated structurally: even strong function-calling agents clear under half of real tool-agent-user tasks, so the reliability has to come from the rails, not the model.

Handoff Triggers: When to Escalate from Bot to Human Agent

The most common objection to autonomous sales agents is safety: what if the LLM hallucinates a plan that sends an aggressive follow-up to a prospect who already unsubscribed? This is the prompt-injection and over-action risk the OWASP LLM Top 10 warns about. The answer in this architecture is a deterministic veto that runs before the planner ever sees the signal. The safety_gate node is wired immediately after load_history and four nodes upstream of plan_tasks. It checks the central suppression list first, then the contact's do_not_contact, bounced, unsubscribed, and replied flags, and short-circuits to END on any hit. The cooldown rule, FOLLOWUP_DAYS = 3 days, is enforced the same deterministic way inside decide_action.

The LLM plan may refine routing but may never override a suppression decision or the too_soon cooldown. The acceptance criterion in the spec is explicit: a contact flagged do_not_contact, suppressed, or too_soon yields action == "skip" regardless of any task plan the LLM proposes. The deterministic veto is a governance layer baked into the graph topology, not a post-hoc audit trail — a structural constraint that makes a whole class of failure modes impossible. The agent gets to reason; the safety gate gets the first and final word, because it runs first. The same instinct — bounding what a reasoning loop can do structurally rather than by prompt — drives the fleet's deadlock and infinite-loop prevention once multiple agents share a graph.

The other real autonomy boundary is the output itself. The orchestrator's compose node always produces a draft (status: "draft"), reached only after the preview confirm-before-mutate interrupt. There is no score-tier gate that promotes a lead from planning to sending inside this graph; the conservative default is that everything is held for a human. Provenance is the third governance layer, and it is exactly the four-lifecycle-phase data-governance discipline (Pahune et al., 2025) made executable: every AI decision carries confidence, reason, source, and evidence, and a claim with no D1 row behind it does not get written. That is the only way to survive an audit when a lead complains that an agent fabricated a reason for skipping them — the reason field points at a row or a deterministic rule, so the audit trail is machine-verifiable rather than a human-readable guess.

The multi-step agent is not a free win. The reasoning path adds one DeepSeek call per run that the four-branch decide_action ladder does not need, which costs more latency and more tokens. For a pure timing-driven nurture flow, the scripted ladder wins: it is cheaper and fully deterministic.

The qualification scoring is also coarse by design. A 0.4 base plus two 0.3 increments yields only a handful of distinct scores around the 0.6 cut line. That coarseness is a feature for auditability and a ceiling on nuance versus a trained model. The scope is narrow too: this is one B2B sales-support orchestrator on one DeepSeek endpoint over Cloudflare D1, not a verdict on every chatbot-to-agent migration. A multi-step sequence reduces unqualified handoffs; it does not eliminate them.

The Research Backbone: The Papers That Frame the Shift

The production graph is one realization of a small but converging research conversation. Six sources frame the chatbot-to-agent shift from complementary angles, each mapping onto a concrete part of the orchestrator — with the deterministic scoring stance grounded directly in the code rather than in any of them.

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise (Bogavelli, Sharma & Subramani, 2025) is the architectural spine. It benchmarks 18 agentic configurations across orchestration approach, prompt strategy (ReAct versus function calling), memory design, and thinking-tool usage, and finds "significant model-specific architectural preferences" — with top models clearing only 35.3% of the complex task and 70.8% of the simpler one. That is the empirical case for constraining the planner the way this orchestrator does: a closed four-stage vocabulary and a single function-calling-style typed plan, rather than an open ReAct loop, is itself an architectural bet that AgentArch's results support for enterprise reliability.
ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2023) supplies the planner paradigm. Its interleaved reason-then-act loop — alternating a reasoning trace with an action across multiple steps — is exactly what plan_tasks does: it reasons over the signal bundle, then emits a typed, ordered task plan before any single action fires. The orchestrator narrows that open-ended loop to a closed vocabulary of exactly 4 stages and grounds each step in code-derived arithmetic rather than free generation, so the agent's freedom is in ordering the 4 stages, never in inventing a 5th. That single architectural narrowing is what makes the loop auditable: a plan is a permutation of a known set, not an open transcript.
Toolformer: Language Models Can Teach Themselves to Use Tools (Schick et al., 2023) frames the function-calling instinct underneath the plan. Its core finding — a model trained on as few as a handful of demonstrations per API can learn which of several tools to call, when, and with which arguments across calculators, search, and Q&A — is what a typed, closed-vocabulary plan formalizes: the model commits to 1 of the 4 named stages rather than narrating, and _repair_task_plan drops any token outside that set of 4 before it can shape routing. The contract is the point: a named call beats a paragraph of intent every time you need to dispatch on it.
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (Yao et al., 2024) supplies the reliability reality check. It shows that even state-of-the-art function-calling agents like gpt-4o succeed on under 50% of real tool-agent-user tasks, and stay below 25% on the pass^8 consistency measure in retail — meaning the same agent given the same task 8 times rarely passes all 8. That under-50% ceiling, paired with AgentArch's 35.3% on complex enterprise tasks, is the direct argument for moving the reliability into deterministic rails — the safety veto, the typed 4-stage vocabulary, the 4-field provenance record — rather than trusting the model to be consistent on its own.
From Scripted Responses to Contextual Understanding (Chellappan, 2024, SSRN working paper) maps the evolution from scripted to contextual dialogue. Its emphasis on turn-by-turn adaptation aligns with plan_tasks, which re-derives the task sequence from current signals on each run. But the adaptation here is bounded, not open-ended: it operates over a vocabulary of exactly 4 stages, gated against 1 of those 4 outcomes, so "contextual understanding" means a typed, constrained plan rather than free-form chat. The paper's own framing — that current chatbots "mimic" rather than understand — is exactly why this orchestrator refuses to let the model's prose stand as a decision: the qualify verdict is base 0.4 plus 2 separate 0.3 increments, not a sentence the model wrote, so the only thing the contextual layer is trusted to do is order the work.
The Importance of AI Data Governance in Large Language Models (Pahune et al., 2025, Big Data and Cognitive Computing) supplies the governance frame, spanning 4 lifecycle phases — development, validation, deployment, and operations. It maps directly onto the orchestrator's Grounding-First provenance record and the deterministic safety gate. Governance here is not documentation written after the fact; it is the 4-field decision record (confidence, reason, source, evidence) and the structural veto that together make the agent's reasoning auditable and its unsafe actions impossible across all 4 of those phases. No graph in the fleet ships until its golden dataset clears the eval gate, so governance and quality are enforced by 1 threshold rather than 2 separate processes — the same number that gates a release also gates every scoring change.

(Industry trade pieces — Rai (2026) and Patel (2026), both in the non-indexed IJAIBDCMS — draw the same conceptual chatbot-versus-agent and real-time-data lines, but the engineering claims above rest on the indexed and canonical sources, not on them.)

Building Your Sequence: Step-by-Step Implementation

The evidence — from the published papers and from the production graph — supports a clear decision framework for when to deploy a multi-step agent versus a scripted chatbot.

Use a scripted chatbot (the decide_action ladder) when:

The qualification path turns on timing alone — reply speed and follow-up cadence — and the FOLLOWUP_DAYS = 3 rule is sufficient.
The cost of a misrouted lead is effectively zero, as in a mass nurture campaign that treats all leads identically.
The team has no observability tooling to debug plan failures. Without LangSmith traces, a failed plan_tasks call is a black box.

Use a multi-step agent (the plan_tasks sequencer) when:

Qualification requires combining data from more than two sources — contact, company, and enrichment — so the _signal_bundle references several Cloudflare D1 tables at once.
The lead base contains high-value contacts that must not receive the wrong message; the safety gate is only meaningful if there are non-suppressed contacts to protect.
The organization has a governance function that can review plan traces and curate the golden datasets behind the eval gate.

Never skip the deterministic veto. If your agent can override do_not_contact, you are not building an agent; you are building a liability. The veto must be structurally enforced, not a prompt instruction. The production graph wires safety_gate before the planner so the LLM plan is never produced for a suppressed contact in the first place.

Log every plan, even aborted ones. The _repair_task_plan output and the safety_gate decisions should feed directly into LangSmith for golden-dataset curation. The eval gate is only meaningful if you have traces of what the agent would have done before the veto stopped it.

Measuring Qualification Accuracy: Key Metrics to Track

A multi-step qualifier is only as trustworthy as the metrics you watch. Track lead-to-opportunity conversion rate, time-to-qualification (inbound to score availability), handoff acceptance rate (sales accepts the lead), and false-positive rate (leads handed to sales that were not ready to buy). The orchestrator instruments these through LangSmith traces, and the same golden-dataset threshold that gates a graph release is the floor for accepting a new scoring change into production. The fleet formalizes that floor as a release decision in Evidence-Driven Release Gates for LLM Sales Agents.

Two design choices make these metrics meaningful rather than vanity numbers. First, because the scripted decide_action path and the agentic plan_tasks path share the same orchestrator and differ only by one feature flag, you can A/B the two on the same lead population and compare conversion and false-positive rates. Second, because every decision carries its four-field provenance, a false positive is debuggable down to the exact qualify verdict (the base plus increments) and the tier cut that let a contact through in the sibling scorer — you are measuring a graded decision, not a black box. No sequence eliminates unqualified handoffs entirely; it reduces them, and the metrics tell you by how much.

FAQ

Q: What is a multi-step lead qualification sequence? A: It is a structured process where a prospect moves through several automated stages — qualification, opportunity analysis, and a routed next action — before any message is composed or a human is involved. In this architecture the sequence is a typed, ordered task plan over a four-stage vocabulary: qualify, analyze_opportunity, compose, or skip.

Q: How is a multi-step sales agent different from a scripted chatbot? A: A scripted chatbot follows a fixed decision tree; a multi-step agent uses reasoning-driven task decomposition to order the work, then grades each step deterministically. The planner first proposes the sequence, the qualify function grades the lead against a 0.6 cut line between needs_review and qualified, and the fleet's sibling composite scorer assigns an A/B/C/D tier (thresholds 0.80/0.60/0.40).

Q: When should the system hand a lead to a human? A: Always — every outreach is draft-first. The orchestrator's compose node produces a held draft (status: "draft") reached only after a preview confirm-before-mutate interrupt, and nothing sends without human approval. There is no score-tier gate that lets the agent send on its own inside this graph.

Q: How do you keep the agent from sending to someone who opted out? A: The deterministic veto. The safety_gate node runs before the planner and short-circuits to END on any central-suppression, do_not_contact, bounced, unsubscribed, or replied hit. No task plan the LLM proposes can override that — the spec requires action == "skip" in those cases regardless of the plan.

The Road Ahead

The literature on multi-step sales agents is still nascent, and there are no published benchmarks comparing scripted versus agentic qualification on conversion rate, time-to-human, or false-positive handoffs. That gap is the opportunity. The orchestrator described here is built for exactly that experiment: the scripted path and the agentic path share the same graph, and the only difference is whether SALES_SUPPORT_SEQUENCER_REASONING is on. Flip the flag off to validate the baseline, flip it on for a subset of low-risk leads, and the two trajectories become directly comparable.

The practical insight is this: agents that sequence work are not more intelligent than chatbots — they are better structured. The capability in this email_orchestrator graph comes from the four-stage vocabulary, the graded qualification verdict, the four-field provenance, and the safety gate that fires before the LLM ever sees the signal. That structure is what turns a chatbot into a lead-qualification system that sequences work. The next article in the fleet, Lead-to-Proposal Multi-Agent Pipeline (#3), takes the qualified lead as its conceptual starting point and sequences the proposal.

The Autonomous Sales Fleet — full series

This is Part 2 of 10 in a series on building one production autonomous-agentic-sales system on LangGraph + DeepSeek + Cloudflare D1, where each part adds one capability that moves the fleet up the autonomy ladder — from human-triggered assistants to self-directed plan→act→verify loops, gated by autonomy guardrails. The arc runs orchestration → enablement & analytics → campaign strategy → reliability & evaluation.

Orchestration

Autonomous CRM Orchestrator (reason→decompose→act→verify) — autonomy: high
Multi-Step Lead Qualification — high
Lead-to-Proposal Multi-Agent Pipeline — high
Hierarchical Coach→Worker Delegation — high

Enablement & analytics 4. Sales-Enablement Copilot: Deal Coaching & Objection Handling — medium 5. NL-to-SQL CRM Analytics over Cloudflare D1 — medium

Campaign strategy 6. Design-Thinking Expert Panels for Campaign Strategy — medium

Reliability & evaluation — the autonomy guardrails 8. Deadlock & Infinite-Loop Prevention — guardrail 9. Evidence-Driven Release Gates (PROMOTE/HOLD/ROLLBACK) — guardrail 10. Detecting Agent Defects & Drift in Production — guardrail

References

The papers below resolve to a public landing page or DOI. Implementation details (node names, scores, thresholds, the feature flag) are first-party values read from the production agentic-sales codebase, not figures from any paper.

Bogavelli, T., Sharma, R., & Subramani, H. (2025). AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise.
Yao, S., et al. (2023). ReAct: Synergizing Reasoning and Acting in Language Models.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools.
Yao, S., Shinn, N., Razavi, P., & Narasimhan, K. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.
Chellappan, R. B. (2024). From Scripted Responses to Contextual Understanding: A Comprehensive Examination of Chatbot Technology. SSRN working paper.
Pahune, S., et al. (2025). The Importance of AI Data Governance in Large Language Models. Big Data and Cognitive Computing.
LangGraph documentation — https://langchain-ai.github.io/langgraph/
DeepSeek API documentation — https://api-docs.deepseek.com/
LangSmith documentation — https://docs.smith.langchain.com/
Cloudflare D1 documentation — https://developers.cloudflare.com/d1/
OWASP Top 10 for LLM Applications (LLM01: Prompt Injection) — https://owasp.org/www-project-top-10-for-large-language-model-applications/

From Scripted Chatbot to Multi-Step Sales Agent: How to Build a Lead Qualification Sequence That Works​

Why Scripted Chatbots Fail to Qualify Leads​

The Multi-Step Lead Qualification Framework​

Mapping Lead Qualification to Sales Readiness Stages​

Handoff Triggers: When to Escalate from Bot to Human Agent​

The Research Backbone: The Papers That Frame the Shift​

Building Your Sequence: Step-by-Step Implementation​

Measuring Qualification Accuracy: Key Metrics to Track​

FAQ​

The Road Ahead​

The Autonomous Sales Fleet — full series​

References​