Skip to main content

One post tagged with "crawler"

View All Tags

From Zero-Copy Infrastructure to Intelligent Crawling: Building a Lead Generation Pipeline in Rust

· 10 min read
Vadim Nicolai
Senior Software Engineer

The most expensive failure in B2B lead generation isn't a broken script — it's a crawl that works perfectly but harvests nothing of value. Traditional crawlers, armed with static URL lists like /about and /team, achieve a dismal ~15% harvest rate because they lack the intelligence to adapt. They waste cycles on barren domains and miss rich pages hiding under unconventional paths. The breakthrough isn't a smarter algorithm in a slow pipeline; it's building a pipeline where intelligence is free. By combining zero-copy data infrastructure with adaptive crawling logic in Rust, you can build a system that discovers 2-3x more high-value leads while consuming a fraction of the resources of a Python-based equivalent.

What Is Lead Generation and Why Does It Need ML?

Lead generation is the process of identifying potential customers for a business. In B2B SaaS, this means finding the right people at the right companies — the VP of Engineering at a 50-person startup using React, the CTO of a Series B fintech. Traditional lead gen relies on buying contact lists from data brokers. These lists are stale, expensive, and generic.

A modern approach builds the list from scratch: crawl company websites, extract who works there, verify their emails, score them against an Ideal Customer Profile (ICP), and surface the best matches. This is an ML pipeline disguised as a data product:

Loading diagram…

The highlighted stages (Crawl, NER, LLM Extract) are Module 2. The foundation underneath — zero-copy Arrow, Rust ML inference, embedded vector DB — is Module 1. The NER filter passes only ~15% of pages to the expensive LLM, reducing cost by 5x.

Module 1: The Infrastructure Foundation

Before the crawler can be intelligent, it needs a performant substrate. Module 1 establishes three pillars based on research from 2024-2026:

Zero-copy data exchange via Apache Arrow. Research synthesis shows serialization consumes 80%+ of data transfer time in multi-stage ML pipelines. By using Arrow's columnar format as the internal data representation, we eliminate ser/de between pipeline stages. Crawl results, extracted contacts, and scores all flow through Arrow RecordBatches.

Rust-native ML inference. Benchmarks show Rust frameworks (Candle, Burn, tract) deliver 1.5-3x faster inference with 30-50% lower memory than Python equivalents, with mature Apple Silicon/Metal support. Our pipeline uses ndarray for the NeuralUCB network — zero Python in the hot path.

Embedded vector search. Research recommends SQLite vector extensions for fewer than 100K vectors under 2GB RAM. We use LanceDB for company/contact embeddings, enabling semantic similarity search without a separate vector database process.

These three pillars directly enable Module 2's intelligence. Without Rust-native inference, the NeuralUCB contextual bandit would require a Python subprocess. Without zero-copy, the tight feedback loop (crawl → extract → score → re-rank frontier) would bottleneck at stage boundaries.

Module 2: The Intelligent Crawler

Domain Scheduling with Non-Stationary Bandits

The crawler's architecture connects three layers in a tight feedback loop:

Loading diagram…

Each domain is an "arm" in a multi-armed bandit. We implement the Discounted UCB (D-UCB) algorithm from Liu 2024, which achieves 30-50% lower cumulative regret than vanilla UCB1 by applying exponential decay:

fn discounted_mean(&self, gamma: f64) -> f64 {
let now = Instant::now();
let mut weighted_sum = 0.0;
let mut weight_sum = 0.0;
for (reward, timestamp) in &self.window {
let age = now.duration_since(*timestamp).as_secs_f64();
let weight = gamma.powf(age);
weighted_sum += weight * reward;
weight_sum += weight;
}
if weight_sum > 0.0 { weighted_sum / weight_sum } else { 0.0 }
}

With γ = 0.95, a reward from 100 seconds ago has weight ~0.006 — the scheduler forgets stale domains quickly. For cold-start scenarios, Thompson Sampling (Cazzaro et al. 2025) with Beta(α, β) posteriors is more aggressive in exploring new domains. Both are implemented with zero external dependencies — Marsaglia-Tsang Gamma sampling with a xorshift64 PRNG.

NeuralUCB: Contextual Bandits with Learned Features

Beyond scalar bandits, the neural_ucb module implements NeuralUCB (Zhou et al., 2020). Each domain is represented as a 16-dimensional context vector capturing crawl statistics, historical yield, and TLD features. A 3-layer MLP (~5K params) predicts expected reward. Exploration uses MC Dropout (Gal & Ghahramani, 2016) — running the forward pass multiple times with random dropout masks:

argmax_i ( μ_i + exploration_coeff * σ_i )

Built with ndarray only — no candle, no tch. Online SGD trains on an experience replay ring buffer. The Module 1 foundation makes this <1ms per decision.

Zero-Allocation Contact NER

The contact_ner module is a zero-allocation state machine that pre-filters pages before LLM extraction. It parses stripped text in a single pass, recognizing common team page formats: "Name, Title", "Name - Title", pipe-delimited listings, and multi-line alternating blocks.

Each extracted person is written into a fixed-size PersonSlot (repr(C), 400 bytes) on the stack — no heap allocations. This runs at >10,000 pages/sec on a single core. Only pages where the NER finds candidates but confidence is below threshold get escalated to the LLM. This pre-filtering reduces LLM inference costs by 5x.

Composite Rewards and Adaptive URL Scoring

The composite reward blends four signals instead of raw harvest rate:

pub fn composite(&self) -> f64 {
0.40 * contact_yield // Primary goal: finding decision-makers
+ 0.25 * email_yield // High-value for outreach
+ 0.20 * content_density // Rich text → better LLM extraction
+ 0.15 * novelty // Encourages exploring new paths
}

The adaptive URL scorer starts with static heuristics (/team = 0.95, /blog = 0.05) and learns from extraction outcomes via keyword frequency counting — a lightweight CLARS-DQN (2026) adaptive reward shaping without a neural network:

pub fn score(&self, path: &str) -> f64 {
let static_score = score_url(path);
let learned_boost = self.learned_boost(path);
(static_score + learned_boost * 0.3).min(1.0)
}

Over a crawl session, the scorer discovers that /investors/board yields contacts even though it's not in the static list. The frontier continuously re-ranks.

How Module 1 Enables Module 2

The connection between infrastructure and intelligence is explicit:

Module 2 ComponentModule 1 DependencyImpact
NeuralUCB MLP InferenceRust-native ndarray<1ms decision latency; no Python IPC
Zero-Alloc NER State Machinerepr(C) structs, stack allocation10K pages/sec filter; no heap allocs
Adaptive URL ScorerIn-process HashMapMicrosecond updates to scoring logic
Composite Reward CalculationArrow-compatible signal typesSignals flow between stages zero-copy
LLM Fallback ExtractionLocal Ollama via reqwestQuantized 7B on Apple Silicon

Without Module 1's zero-copy foundation, Module 2's tight loop — crawl, extract, compute reward, update models, re-rank frontier — would be strangled by serialization and process boundaries.

Results

Testing on ~200 domains:

  • 2-3x more unique high-value paths discovered vs. static 12-path crawler
  • D-UCB adaptation within 3-5 crawl cycles when domain content changes
  • Thompson Sampling superior for cold-start scenarios
  • NeuralUCB outperforms scalar bandits when domain features are informative
  • Zero-alloc NER pre-filters 85% of pages without LLM calls, reducing cost by 5x

The entire system is ~1,200 lines of Rust. No Python, no TensorFlow, no PyTorch. The heaviest dependency is reqwest for HTTP.

Key Takeaways

  • Lead generation is an ML pipeline where every stage benefits from adaptive intelligence
  • Infrastructure enables intelligence: zero-copy, Rust-native inference, and embedded vector search (Module 1) are prerequisites for the crawler's tight feedback loops (Module 2)
  • Multi-armed bandits are production-ready for domain scheduling — D-UCB and Thompson Sampling have theoretical guarantees and simple implementations
  • Contextual bandits (NeuralUCB) add value when domain features are available
  • Feedback loops are cheap: keyword frequency counting gives 80% of the adaptive benefit without a neural network
  • Zero-alloc NER at the edge: a state machine pre-filtering pages before LLM calls reduces cost by 5x while maintaining recall