ScrapeGraphAI Qwen3-1.7B: Fine-Tuned Web Extraction Model and 100k Dataset
Leading cloud extraction APIs are orders of magnitude larger than the model that just beat them at structured web extraction. This isn't a marginal win — it's a 3.4 percentage point lead on the gold-standard SWDE benchmark. The secret isn't a novel architecture; it's domain-specific fine-tuning on a 100,000-example dataset of real scraping trajectories. The ScrapeGraphAI team's release of a fine-tuned Qwen3-1.7B model flips the conventional scaling law on its head and delivers a complete, Apache 2.0-licensed stack for production. This is a blueprint for how narrow, expert models will outperform generalist giants — if you have the right data.
The Four Artifacts: A Complete Open-Source Stack
Most open-source model releases are just weights. ScrapeGraphAI's contribution is a full-stack data pipeline: raw trajectories, a curated dataset, and two inference-ready model formats. The four core artifacts are interdependent — each is necessary for the next to exist.
-
Full Dataset (
scrapegraphai/scrapegraphai-100k): 93,700 rows in Parquet format, each a complete scraping "trajectory" with URL, raw HTML, user prompt, output JSON schema, and the LLM-produced structured output. The 17 fields include diagnostics liketokens_in,latency_ms,successflag, and aschema_complexity_score. This is observability data for a scraping process — not just input-output pairs. -
Fine-Tuning Split (
scrapegraphai/scrapegraph-100k-finetuning): A filtered set of 28,000 high-quality examples (25.2k train, 2.8k test) wheresuccess=Trueand complexity clears a threshold. Rows are reformatted as instruction-tuning pairs: system prompt + user message (HTML + extraction prompt + schema) + assistant response (structured JSON). -
BF16 Model (
scrapegraphai/sgai-qwen3-1.7b): The 1.7B parameter model, fine-tuned with a standard causal LM objective on the 25.2k training examples. Stored in Safetensors format, ready for GPU inference on a single RTX 3090 or M2 MacBook Pro. -
GGUF Model (
scrapegraphai/sgai-qwen3-1.7b-gguf): A quantized variant compatible withllama.cpp,mlx_lm, andollama. At roughly 1.1 GB for Q4_K_M quantization, it enables full local inference on consumer Apple Silicon.
The stack's value is in its lineage. You can trace a prediction back through the model, to the fine-tuning data, to the original HTML and the frontier LLM (from various providers) that generated the training label. This audit trail is uncommon and critical for debugging production extraction systems.
The Training Pipeline: From Raw Data to Deployed Artifact
The dataset's power isn't its size — 100k examples is modest by modern standards. It's the provenance. Each example was generated by running the open-source ScrapeGraphAI library against a real website. The library uses a graph-based approach to reason over HTML and extract structured data, invoking various frontier LLMs as the "extraction engine." This captures real-world failure modes: cluttered HTML, dynamic content, anti-bot markup, and ambiguous schemas.
The SFT step is straightforward: standard next-token prediction loss on instruction-formatted pairs. No RLHF, no DPO. The base model already has language and structural capabilities; fine-tuning teaches it one specific input-output mapping. The model_used field allows filtering for examples solved by specific base models, enabling targeted knowledge distillation. The schema_complexity_score prevents the model from wasting capacity on trivial extractions.
At 1.7B parameters the training runs in 4–8 hours on an A100. The BF16 model is the primary artifact — full precision, runs on a single RTX 3090 or M2 MacBook Pro. The GGUF export is the deployment artifact. Q4_K_M quantization drops it to ~1.1 GB while preserving most accuracy.
The Benchmark: Specialization Beats Scale
On SWDE (80 websites, 8 domains, 124k field annotations), the fine-tuned 1.7B model achieves 94.6% field-level F1. Leading cloud extraction APIs at time of publication scored 91.2%. On SyntheticWebExtract, schema accuracy is 89.3% vs. 87.1% for cloud APIs.
A 3–4 point gap sounds modest until you understand SWDE context: top performers have historically been separated by fractions of a point. The explanation is structural. Web extraction for a fixed schema is a constrained language game — HTML in, JSON out. The 25.2k training examples densely cover this narrow problem space. A large generalist has capacity spread across every task it was trained on; the fine-tuned model has all capacity focused on one. The cost implication is equally significant: for a pipeline processing 100,000 pages a day, shifting from a cloud API to a local GGUF means the difference between 2,000 in daily API costs and effectively zero.
The Competitor Landscape: Three Research Bets
Three recent papers exemplify alternative approaches to the same structured extraction problem:
AXE focuses on pre-processing. A lightweight classifier prunes irrelevant DOM subtrees before passing cleaned content to any frozen LLM. Its key insight: most of the HTML is noise, and context window limits are the real bottleneck. Achieves 88.10% F1 on SWDE using a 0.6B model. Weakness: pruning can discard relevant data; no fine-tuning means no domain adaptation and a lower accuracy ceiling.
Dripper reframes extraction as sequence labeling (NER-style) directly on the HTML token stream. Instead of generating JSON, the model tags which spans correspond to which schema fields. This is deterministic and avoids hallucinations. Claims competitive performance with leading cloud APIs using a 0.6B model. Weakness: sequence labeling is brittle on deeply nested, multi-level schemas that are naturally expressed as JSON.
SLOT attacks the output: a 7B model generates initial extraction, then a separate post-processing model corrects schema violations — missing required fields, type mismatches, extra keys. Achieves 99.5% schema accuracy — best in class. Weakness: two-stage pipeline, 7B+ total parameters, higher latency and memory footprint.
The ScrapeGraphAI model sits in the middle: end-to-end fine-tuning of a compact model for direct JSON generation. It beats AXE on accuracy (94.6% vs. 88.1%), matches Dripper's cloud-API parity, and achieves high schema accuracy without SLOT's separate repair stage.
Local Inference: The GGUF Path to Sovereign Extraction
# Serve locally on an M-series Mac
mlx_lm.server --model scrapegraphai/sgai-qwen3-1.7b-gguf --port 8080
You now have a dedicated extraction API behind your firewall. No data leaves your infrastructure — a non-negotiable requirement for pipelines touching PII, competitive intelligence, or compliance-sensitive content. The cost model shifts from per-API-call to fixed hardware amortization.
On an M1 Pro: ~40 tokens/second. Processing 1,000 pages with 200-token outputs takes about 2 minutes of inference. At 100k extractions/month the API savings justify a dedicated Mac Mini in weeks. The architecture also changes: instead of a fragile dependency on an external API's availability, cost, and rate limits, you deploy a container with the GGUF file and an inference server — a private utility as reliable as your own infrastructure.
Production Pipeline: Where Specialized LLMs Meet Traditional Scrapers
The real test of any model is how it fits into a production system. The agentic lead generation platform at agenticleadgen.xyz — source on GitHub — uses a hybrid architecture combining traditional scrapers and LLM-based extraction.
The pipeline's fast path uses Rust's scraper crate for zero-allocation DOM parsing, a custom finite-state machine for email extraction at ~100 pages/second, and JSON-LD / Open Graph parsers for structured metadata. The sgai-qwen3-1.7b GGUF handles the flexible schema extraction that rule-based systems cannot: structured company profiles from diverse "About Us" pages where field names and HTML structures vary. The custom NER model (92.3% F1) and the LLM extraction step share the same M1 infrastructure — no additional API spend.
This is the production reality of specialized models: not a monolithic LLM call, but a directed graph of specialized tools. Rule-based scrapers handle high-throughput predictable tasks; fine-tuned LLMs handle variable, schema-driven tasks. The 1.7B GGUF model fits this pattern — small enough to run alongside the Rust pipeline on the same hardware.
Decision Framework: When to Use Which Tool
Use sgai-qwen3-1.7b (local GGUF) when:
- Volume is high (>10k extractions/month) — API savings amortize setup quickly.
- Data sensitivity is high (PII, competitive intelligence) — on-prem inference is required.
- Domain is narrow — fine-tune further on a domain-filtered subset for additional accuracy lift.
- Infrastructure is Apple Silicon or consumer GPU — GGUF drops in with zero infrastructure changes.
Use a frontier API when:
- Volume is low and sporadic — setup cost of a local model isn't justified.
- Schemas change constantly — need zero-shot generalization from a description.
- You lack ML/Ops bandwidth to manage model deployment and quantization.
- Pipeline already works — don't rewrite what clears your accuracy thresholds.
Use traditional scrapers (Scrapy, Rust scraper) when:
- Website structure is simple and stable.
- Throughput and latency are paramount (thousands of pages per second).
- Extraction logic is trivial (e.g., get all
hrefattributes).
The sweet spot for sgai-qwen3-1.7b is the middle ground: tasks too variable for rule-based scrapers, too expensive and sensitive for constant API calls.
Practical Implementation: Domain-Specific Fine-Tuning
from datasets import load_dataset
ds = load_dataset("scrapegraphai/scrapegraphai-100k")
# Filter to your vertical
ecomm = ds['train'].filter(
lambda x: x['success']
and 'product' in x['user_prompt'].lower()
and any(d in x['url'] for d in ['amazon.com', 'bestbuy.com', 'newegg.com'])
)
# Feed to Axolotl or TRL against Qwen3-1.7B for a vertical-specialist model
The 1.7B parameter count makes domain fine-tuning feasible on a single consumer GPU in a few hours. Filter by schema_complexity_score to keep only the cleanest examples, or by model_used to train exclusively on outputs from specific base models.
The Broader Implication
The ScrapeGraphAI paper is a case study in a repeating pattern: an open-source tool generates labeled trajectories at scale, those trajectories fine-tune a compact base model, the result outperforms general-purpose APIs on the narrow task and runs locally. The barrier to high-accuracy domain AI is no longer pre-training compute — it's the discipline to collect and curate task-specific data. For any repetitive, well-defined extraction task, the question is no longer "which API?" but "do I have 10–25k clean examples for a 1–3B fine-tune?" If yes, the specialized model wins on accuracy, cost, and data sovereignty.
