One post tagged with "crawler"

From Zero-Copy Infrastructure to Intelligent Crawling: Building a Lead Generation Pipeline in Rust

March 27, 2026 · 10 min read

Senior Software Engineer

The most expensive failure in B2B lead generation isn't a broken script — it's a crawl that works perfectly but harvests nothing of value. Traditional crawlers, armed with static URL lists like /about and /team, achieve a dismal ~15% harvest rate because they lack the intelligence to adapt. They waste cycles on barren domains and miss rich pages hiding under unconventional paths. The breakthrough isn't a smarter algorithm in a slow pipeline; it's building a pipeline where intelligence is free. By combining zero-copy data infrastructure with adaptive crawling logic in Rust, you can build a system that discovers 2-3x more high-value leads while consuming a fraction of the resources of a Python-based equivalent.

What Is Lead Generation and Why Does It Need ML?

Lead generation is the process of identifying potential customers for a business. In B2B SaaS, this means finding the right people at the right companies — the VP of Engineering at a 50-person startup using React, the CTO of a Series B fintech. Traditional lead gen relies on buying contact lists from data brokers. These lists are stale, expensive, and generic.

A modern approach builds the list from scratch: crawl company websites, extract who works there, verify their emails, score them against an Ideal Customer Profile (ICP), and surface the best matches. This is an ML pipeline disguised as a data product:

Loading diagram…

The highlighted stages (Crawl, NER, LLM Extract) are Module 2. The foundation underneath — zero-copy Arrow, Rust ML inference, embedded vector DB — is Module 1. The NER filter passes only ~15% of pages to the expensive LLM, reducing cost by 5x.

Module 1: The Infrastructure Foundation

Before the crawler can be intelligent, it needs a performant substrate. Module 1 establishes three pillars based on research from 2024-2026:

Zero-copy data exchange via Apache Arrow. Research synthesis shows serialization consumes 80%+ of data transfer time in multi-stage ML pipelines. By using Arrow's columnar format as the internal data representation, we eliminate ser/de between pipeline stages. Crawl results, extracted contacts, and scores all flow through Arrow RecordBatches.

Rust-native ML inference. Benchmarks show Rust frameworks (Candle, Burn, tract) deliver 1.5-3x faster inference with 30-50% lower memory than Python equivalents, with mature Apple Silicon/Metal support. Our pipeline uses ndarray for the NeuralUCB network — zero Python in the hot path.

Embedded vector search. Research recommends SQLite vector extensions for fewer than 100K vectors under 2GB RAM. We use LanceDB for company/contact embeddings, enabling semantic similarity search without a separate vector database process.

These three pillars directly enable Module 2's intelligence. Without Rust-native inference, the NeuralUCB contextual bandit would require a Python subprocess. Without zero-copy, the tight feedback loop (crawl → extract → score → re-rank frontier) would bottleneck at stage boundaries.

Module 2: The Intelligent Crawler

Domain Scheduling with Non-Stationary Bandits

The crawler's architecture connects three layers in a tight feedback loop:

Loading diagram…

Each domain is an "arm" in a multi-armed bandit. We implement the Discounted UCB (D-UCB) algorithm from Liu 2024, which achieves 30-50% lower cumulative regret than vanilla UCB1 by applying exponential decay:

fn discounted_mean(&self, gamma: f64) -> f64 {
    let now = Instant::now();
    let mut weighted_sum = 0.0;
    let mut weight_sum = 0.0;
    for (reward, timestamp) in &self.window {
        let age = now.duration_since(*timestamp).as_secs_f64();
        let weight = gamma.powf(age);
        weighted_sum += weight * reward;
        weight_sum += weight;
    }
    if weight_sum > 0.0 { weighted_sum / weight_sum } else { 0.0 }
}

With γ = 0.95, a reward from 100 seconds ago has weight ~0.006 — the scheduler forgets stale domains quickly. For cold-start scenarios, Thompson Sampling (Cazzaro et al. 2025) with Beta(α, β) posteriors is more aggressive in exploring new domains. Both are implemented with zero external dependencies — Marsaglia-Tsang Gamma sampling with a xorshift64 PRNG.

NeuralUCB: Contextual Bandits with Learned Features

Beyond scalar bandits, the neural_ucb module implements NeuralUCB (Zhou et al., 2020). Each domain is represented as a 16-dimensional context vector capturing crawl statistics, historical yield, and TLD features. A 3-layer MLP (~5K params) predicts expected reward. Exploration uses MC Dropout (Gal & Ghahramani, 2016) — running the forward pass multiple times with random dropout masks:

argmax_i ( μ_i + exploration_coeff * σ_i )

Built with ndarray only — no candle, no tch. Online SGD trains on an experience replay ring buffer. The Module 1 foundation makes this <1ms per decision.

Zero-Allocation Contact NER

The contact_ner module is a zero-allocation state machine that pre-filters pages before LLM extraction. It parses stripped text in a single pass, recognizing common team page formats: "Name, Title", "Name - Title", pipe-delimited listings, and multi-line alternating blocks.

Each extracted person is written into a fixed-size PersonSlot (repr(C), 400 bytes) on the stack — no heap allocations. This runs at >10,000 pages/sec on a single core. Only pages where the NER finds candidates but confidence is below threshold get escalated to the LLM. This pre-filtering reduces LLM inference costs by 5x.

Composite Rewards and Adaptive URL Scoring

The composite reward blends four signals instead of raw harvest rate:

pub fn composite(&self) -> f64 {
    0.40 * contact_yield    // Primary goal: finding decision-makers
    + 0.25 * email_yield    // High-value for outreach
    + 0.20 * content_density // Rich text → better LLM extraction
    + 0.15 * novelty        // Encourages exploring new paths
}

The adaptive URL scorer starts with static heuristics (/team = 0.95, /blog = 0.05) and learns from extraction outcomes via keyword frequency counting — a lightweight CLARS-DQN (2026) adaptive reward shaping without a neural network:

pub fn score(&self, path: &str) -> f64 {
    let static_score = score_url(path);
    let learned_boost = self.learned_boost(path);
    (static_score + learned_boost * 0.3).min(1.0)
}

Over a crawl session, the scorer discovers that /investors/board yields contacts even though it's not in the static list. The frontier continuously re-ranks.

How Module 1 Enables Module 2

The connection between infrastructure and intelligence is explicit:

Module 2 Component	Module 1 Dependency	Impact
NeuralUCB MLP Inference	Rust-native `ndarray`	<1ms decision latency; no Python IPC
Zero-Alloc NER State Machine	`repr(C)` structs, stack allocation	10K pages/sec filter; no heap allocs
Adaptive URL Scorer	In-process `HashMap`	Microsecond updates to scoring logic
Composite Reward Calculation	Arrow-compatible signal types	Signals flow between stages zero-copy
LLM Fallback Extraction	Local Ollama via `reqwest`	Quantized 7B on Apple Silicon

Without Module 1's zero-copy foundation, Module 2's tight loop — crawl, extract, compute reward, update models, re-rank frontier — would be strangled by serialization and process boundaries.

Results

Testing on ~200 domains:

2-3x more unique high-value paths discovered vs. static 12-path crawler
D-UCB adaptation within 3-5 crawl cycles when domain content changes
Thompson Sampling superior for cold-start scenarios
NeuralUCB outperforms scalar bandits when domain features are informative
Zero-alloc NER pre-filters 85% of pages without LLM calls, reducing cost by 5x

The entire system is ~1,200 lines of Rust. No Python, no TensorFlow, no PyTorch. The heaviest dependency is reqwest for HTTP.

Key Takeaways

Lead generation is an ML pipeline where every stage benefits from adaptive intelligence
Infrastructure enables intelligence: zero-copy, Rust-native inference, and embedded vector search (Module 1) are prerequisites for the crawler's tight feedback loops (Module 2)
Multi-armed bandits are production-ready for domain scheduling — D-UCB and Thompson Sampling have theoretical guarantees and simple implementations
Contextual bandits (NeuralUCB) add value when domain features are available
Feedback loops are cheap: keyword frequency counting gives 80% of the adaptive benefit without a neural network
Zero-alloc NER at the edge: a state machine pre-filtering pages before LLM calls reduces cost by 5x while maintaining recall

What Is Lead Generation and Why Does It Need ML?​

Module 1: The Infrastructure Foundation​

Module 2: The Intelligent Crawler​

Domain Scheduling with Non-Stationary Bandits​

NeuralUCB: Contextual Bandits with Learned Features​

Zero-Allocation Contact NER​

Composite Rewards and Adaptive URL Scoring​

How Module 1 Enables Module 2​

Results​

Key Takeaways​