LLM Lead Conversion-Propensity Scoring for B2B Lead Prioritization

June 5, 2026 · 12 min read

Senior Software Engineer

The published literature on lead scoring converges on a couple of recurring findings. A B2B feature-importance analysis identified lead source and lead status as the most predictive conversion features (Frontiers in AI, 2025). And a supervised classifier trained on labelled outcomes tends to beat both rule-based heuristics and manual qualification. Yet many B2B teams deploying an LLM for lead prioritisation skip the classifier, skip the labelled outcomes, and instead ask the model to reason its way to a score from contact evidence. Is that defensible, or is it cargo-cult AI?

This article describes a research-grounded LangGraph design that does exactly this — a schema-constrained LLM reasoner scoring conversion propensity from features the academic literature found predictive — and argues it is defensible only if you are honest about what you are not building. It dissects the research, the implementation decisions, and the trade-offs of choosing an LLM over a gradient-boosting model for lead scoring. It is a design grounded in the literature, not a deployed system with measured business outcomes.

What Conversion-Propensity Scoring Means for B2B

Conversion-propensity scoring assigns each lead a score (0–1) reflecting its likelihood of becoming a paying customer, given the observable evidence. In B2B this is not trivial: deal cycles span months, multiple stakeholders are involved, and the convert event is rare relative to the lead pool. The Frontiers in AI (2025) case study evaluated fifteen classification algorithms on real B2B CRM data (January 2020 – April 2024); a Gradient Boosting classifier won on accuracy and ROC AUC, and the paper reports it improved the company's ability to identify high-quality leads over traditional methods. That model required four years of historical CRM outcomes, carefully labelled. Most B2B teams simply lack that labelled data.

The gap between ideal and available is where LLM-based propensity scoring enters. Rather than training a classifier, you prompt an LLM to evaluate a lead's evidence against known predictive features and emit a score. The key question: is that score better than a naive baseline (uniform random, or a simple rule like "inbound demo requests first")?

Why Use an LLM Instead of a Traditional Model?

Sanjei et al. (2026) compared a hybrid LLM + LightGBM pipeline against keyword-based intent detection using raw inbound sales emails (DOI 10.1051/itmconf/20268503002). They report that LLM semantic understanding "dramatically outstrips" keyword-based intent detection, attributing the lift to the LLM's ability to understand context, urgency, and sentiment from unstructured text — dimensions that structured firmographic models cannot capture.

The implication is not that LLMs replace supervised models, but that they unlock a different signal source. Traditional lead scoring consumes structured fields (company size, industry, engagement count) because that is what those models ingest. An LLM can consume the same structured fields plus raw text of email threads, call transcripts, and support tickets. Lead Sense AI (Sanjei et al., 2026) is the precedent for using an LLM as the scoring reasoner rather than keyword heuristics, and for emitting intermediate semantic features as evidence.

But the contrarian reality: the LLM is not a classifier. It cannot be trained to maximise AUC on your specific conversion outcome. It is a reasoning engine that simulates a classifier by applying a prompt describing empirically-predictive features from other contexts. The quality of that simulation depends entirely on how well the prompt matches the actual decision boundary — and that is where the research grounding becomes essential.

The Two Features That Lead

The most important paper for anyone building a prompt-based propensity scorer is the B2B lead-scoring study in Frontiers in Artificial Intelligence (2025) (DOI 10.3389/frai.2025.1554325). It ran a case study on real CRM data from a B2B software company (January 2020 – April 2024), evaluating fifteen classification algorithms. Gradient Boosting won on accuracy and ROC AUC. More importantly, its feature-importance analysis found that "source" and "lead status" were the two features that "increased the accuracy of the conversion prediction" above all others.

This finding becomes the core of the LLM prompt. The model is instructed to evaluate the lead primarily by its source (inbound demo request, outbound cold email, referral, event attendee, etc.) and its pipeline stage (MQL, SQL, opportunity). Firmographics and prior engagement act as secondary modifiers. This is a Grounding-First decision: the LLM is told to reason from the empirically-predictive features, not to invent its own.

The InsureAI paper (2026) reinforces this taxonomy (DOI 10.5281/zenodo.19603829). It scores conversion probability from "demographic, behavioural, and campaign-related attributes" and bins leads into priority tiers. Its abstract reports no headline metric like F1 or AUC — only the qualitative finding that ML models "effectively support data-driven decision-making, reduce manual effort, and improve customer conversion rates." The tier-binning structure is structurally similar to a _score_to_tier function that maps composite scores into A/B/C/D tiers, providing an organisational framework that maps onto this implementation.

The Architecture

This design extends an existing score_contact_graph (a single-node LangGraph that already scored role fit via DeepSeek) with a new sub-score: conversion_propensity. The node calls ainvoke_json_validated with a Pydantic schema ConversionPropensity, which enforces:

class ConversionPropensity(BaseModel):
    propensity: float = Field(ge=0.0, le=1.0)
    confidence: float = Field(ge=0.0, le=1.0)
    reason: str
    evidence: list[str] = Field(default_factory=list)

The prompt includes the contact's source, lead_status/stage, firmographics, and prior engagement notes. The LLM (DeepSeek, temperature=0.0, standard tier) outputs JSON like:

{ "propensity": 0.72, "confidence": 0.6, "reason": "inbound demo request from a director",
  "evidence": ["lead source = inbound demo request", "stage = MQL", "company size = 200"] }

This is then blended into the composite score alongside existing sub-scores (seniority, role_fit, reachability). The weights are renormalised to sum to 1.0; the design allocates a modest weight to propensity, carved proportionally from the other three, so that a neutral propensity (0.5) reproduces today's ranking exactly. This avoids regression in the existing decision pipeline.

Crucially, the node fails open. If the LLM call raises LlmDisabledError or a parse error, the helper returns a vertical prior of 0.5 with confidence=0.0 and source="propensity_prior_fallback". The composite then degrades to the old three-term blend. This mirrors the role_fit node's behaviour and ensures the feature does not break core scoring flow when the LLM is down.

Why Not Train a Classifier (And Why That's Honest)

Every primary paper cited trains a supervised classifier — Gradient Boosting (Frontiers), LightGBM (Lead Sense), or a dashboard-based model (InsureAI). The supervised path is the right one for production lead scoring, assuming you have labelled conversion outcomes. Many teams do not.

Conversion-rate prediction suffers from selection bias (you only see outcomes for leads you contacted) and data sparsity (most leads never convert). Formal solutions like doubly-robust propensity plus imputation — as explored in a 2025 SIGIR paper on adaptive structure learning for post-click CVR prediction (DOI 10.1145/3726302.3729887) — require a training set of known outcomes. A pipeline that joins approximately zero lead-to-conversion outcomes to verdicts cannot honestly fit such a classifier; training on that would produce a biased or zero-information model.

The LLM reasoner avoids this by not claiming to be a fitted model. It is a zero-shot evaluator using features that other research found predictive, applied to a specific lead. It will not generalise as well as a custom classifier, but it generalises immediately, without labelled data. For B2B teams in the "collecting outcomes" phase, this is a pragmatic entry point.

Propensity vs. Uplift: A Caveat Worth Naming

The design above predicts the probability that a lead converts given current behaviour. A more actionable question for a sales team is: which leads will we cause to convert by contacting them? This is the difference between propensity modeling and uplift modeling. Propensity models can give high scores to leads that would convert anyway, while uplift modeling isolates the incremental effect of the sales intervention. Uplift typically requires A/B-test data or counterfactual estimation, which most teams lack. This is a known limitation to be aware of: a propensity scorer optimises for a different target than uplift, and any team relying on it for "who do I call now?" decisions should keep that distinction in mind.

Trust, Transparency, and the B2B Buyer

Opaque AI decisions erode trust. This implementation addresses that by persisting {confidence, reason, evidence} provenance for every propensity score. The sales rep sees not just a number but a rationale: "propensity=0.72 because the lead is an inbound demo request from a director at a 200-person company (evidence: source=inbound, stage=MQL, company_size=200)." That is a grounded, auditable judgement rather than a black box.

The evidence field also lets the sales rep override the score when they see a missing signal (for example, a referral whose referrer reputation the LLM did not consider). Transparency here is an operational necessity, not only an ethical one.

A Decision Framework for LLM vs. Traditional ML

Based on the research and this design, the following heuristic applies:

Situation	Recommended Approach	Why
You have a substantial set of labelled conversion outcomes	Supervised Gradient Boosting or LightGBM	Wins on accuracy/ROC AUC in the B2B case study (Frontiers 2025); lower per-inference cost
You have unstructured text (emails, chat) but few labels	LLM reasoner (prompt grounded in literature)	Semantic lift from text is the documented advantage (Lead Sense, 2026)
You have no labelled outcomes at all	LLM reasoner (fail-open, hybrid with rule-based composite)	Avoids training a biased classifier; provides immediate prioritisation
You want to maximise ROI per sales action	Uplift model (requires A/B test data)	Optimises incremental lift, not raw propensity
You need explainability for the sales team	LLM reasoner with `{evidence}` provenance	Auditable rationale rather than a black-box number
You process very high volumes with structured fields only	Gradient Boosting	LLM cost scales linearly per contact

This design sits in the second and third rows. It is a bridge between "no data" and "enough data to train a real model." Once conversion outcomes are captured and joined (the deferred roadmap item), a team could calibrate the LLM's propensity against actual outcomes and eventually move to a supervised classifier, demoting the LLM reasoner to a feature generator — the hybrid shape Lead Sense AI describes.

Practical Takeaways

Ground your prompt in feature-importance research. Don't let the LLM guess which features matter. Anchor it to the known top predictors (source, lead status) from the Frontiers 2025 study and structure the evidence field around them.
Fail open. An LLM call can fail (rate limit, parse error). A neutral prior of 0.5 preserves ranking stability when the model is down.
Renormalise weights. Adding a fourth sub-score to a composite summing to 1.0 must not change ranking for leads with average propensity. A proportional carve gives the new term modest influence without breaking existing behaviour.
Collect the outcomes you're missing. The LLM reasoner is a stopgap. Every lead score should eventually be linked to a conversion event so you can train a real classifier. Formal methods for handling selection bias (e.g., doubly-robust estimation from the SIGIR 2025 literature) become applicable once you have even a small labelled set.
Don't conflate propensity with uplift. If your sales team asks "which lead should I call now?", propensity scoring can give the wrong answer if those leads would convert anyway. Consider a small randomised holdout before committing to a permanent pipeline.

FAQ

Q: What is lead conversion propensity scoring? A: It assigns each lead a score reflecting its likelihood of converting to a customer, based on observable signals. It helps sales teams prioritise the contacts most likely to close.

Q: How does an LLM improve lead scoring? A: An LLM can analyse unstructured data (email conversations, call transcripts, support tickets) alongside structured fields, capturing context and intent that rule-based or regression models miss (Sanjei et al., 2026).

Q: What data do I need for LLM propensity scoring? A: The LLM reasoner can work with the lead's current source, stage, firmographics, and any textual interaction history. Labelled historical outcomes are required for supervised models, not for the reasoner.

Q: What are common pitfalls in LLM-based scoring? A: Treating the score as a fitted probability, ignoring selection bias and data sparsity, conflating propensity with uplift, and assuming the LLM generalises like a trained classifier.

The Broader Implication

LLM-based propensity scoring for B2B prioritisation is not a magic bullet — it is a pragmatic compromise. The literature shows that supervised models, especially Gradient Boosting, remain the stronger choice when labelled data is available. But for many B2B teams, labelled conversion outcomes are some way off. An LLM reasoner, built with a prompt grounded in features research identified as predictive, can provide an improvement over manual or rule-based methods while remaining explainable.

The next step for this area is not better prompts — it is better data pipelines that capture outcomes, so that the classifiers the papers recommend can eventually be trained. Until then, an honest design deploys an LLM node with transparent provenance, a neutral prior, and a clear roadmap to replace it.

Works Cited

The relevance of lead prioritization: a B2B lead scoring model based on ML (2025). Frontiers in Artificial Intelligence. DOI 10.3389/frai.2025.1554325
InsureAI: Predictive Customer Conversion And Lead Prioritization System (2026). DOI 10.5281/zenodo.19603829
Sanjei, S., Vishal, V. M., & Stanlin Prija, N. (2026). Lead Sense AI: An LLM-Powered ML and NLP System for Automated Sales Email Intent Scoring. ITM Web of Conferences. DOI 10.1051/itmconf/20268503002
Adaptive Structure Learning for Post-Click Conversion Rate Prediction (2025). SIGIR. DOI 10.1145/3726302.3729887

What Conversion-Propensity Scoring Means for B2B​

Why Use an LLM Instead of a Traditional Model?​

The Two Features That Lead​

The Architecture​

Why Not Train a Classifier (And Why That's Honest)​

Propensity vs. Uplift: A Caveat Worth Naming​

Trust, Transparency, and the B2B Buyer​

A Decision Framework for LLM vs. Traditional ML​

Practical Takeaways​

FAQ​

The Broader Implication​