ScrapeGraphAI Qwen3-1.7B: Fine-Tuned Web Extraction Model and 100k Dataset
Leading cloud extraction APIs are orders of magnitude larger than the model that just beat them at structured web extraction. This isn't a marginal win — it's a 3.4 percentage point lead on the de facto standard SWDE benchmark. The secret isn't a novel architecture; it's domain-specific fine-tuning on a 100,000-example dataset of real scraping trajectories. The ScrapeGraphAI team's release of a fine-tuned Qwen3-1.7B model flips the conventional scaling law on its head and delivers a complete open-source stack (model and dataset under Apache 2.0, library under MIT) for production. This is a blueprint for how narrow, expert models will outperform generalist giants — if you have the right data.
The Four Artifacts: A Complete Open-Source Stack
Most open-source model releases are just weights. ScrapeGraphAI's contribution is a full-stack data pipeline: raw trajectories, a curated dataset, and two inference-ready model formats. The four core artifacts are interdependent — each is necessary for the next to exist.
-
Full Dataset (
scrapegraphai/scrapegraphai-100k): 93,700 rows in Parquet format, each a complete scraping "trajectory" with URL, raw HTML, user prompt, output JSON schema, and the LLM-produced structured output. The 19 fields include diagnostics likeexecution_time,response_size,response_is_validflag, and aschema_complexity_score. This is observability data for a scraping process — not just input-output pairs. -
Fine-Tuning Split (
scrapegraphai/scrapegraph-100k-finetuning): A filtered set of 28,000 high-quality examples (25.2k train, 2.8k test) filtered by character-length limits (schema ≤10k chars, content ≤50k chars, response ≤10k chars). Rows are reformatted as instruction-tuning pairs: system prompt + user message (HTML + extraction prompt + schema) + assistant response (structured JSON). -
BF16 Model (
scrapegraphai/sgai-qwen3-1.7b): The 1.7B parameter model, fine-tuned via QLoRA (4-bit quantized base, LoRA rank 16 on all linear projections) on the 25.2k training examples. Stored in Safetensors format, ready for GPU inference on a single RTX 3090 or M2 MacBook Pro. -
GGUF Model (
scrapegraphai/sgai-qwen3-1.7b-gguf): A quantized variant compatible withllama.cpp,mlx_lm, andollama. At roughly 1.1 GB for Q4_K_M quantization, it enables full local inference on consumer Apple Silicon.
The stack's value is in its lineage. You can trace a prediction back through the model, to the fine-tuning data, to the original HTML and the frontier LLM (from various providers) that generated the training label. This audit trail is uncommon and critical for debugging production extraction systems.
One number in the dataset is worth pausing on: 93,700 raw rows filtered down to 28,000 fine-tuning examples — a 70% discard rate. The paper's filter is character-length limits: oversized schemas, content, or responses are excluded. The field response_is_valid flags whether the extraction succeeded, and the raw dataset retains invalid rows for failure-mode research. The model was trained exclusively on clean successes, which means its exposure to messy, adversarial, or ambiguous inputs came only from whatever fraction of "successes" happened to involve difficult pages — not from explicit hard-negative training.
How Fine-Tuning Reshapes the Extraction Task
The Qwen3-1.7B instruct model knows language. Fine-tuning teaches it one specific job.
The key shift is from in-context learning to in-weights learning. A base model prompted to extract data must figure out what to ignore, what structure to follow, and what format to emit — every call, from a fresh context window. The fine-tuned model has internalized those rules:
- Noise suppression — navigation bars, cookie banners, and ad markup are structurally similar across millions of pages in the training set. The model learns to treat them as irrelevant without being told.
- Semantic block recognition — repeating elements like product cards or job listings share semantics even when HTML class names differ site to site. SFT on diverse trajectories builds this abstraction into weights.
- Format consistency — the 25.2k training examples are all instruction-formatted pairs with valid JSON as the target. The model's output distribution shifts toward valid, schema-conforming JSON without explicit post-processing.
The result is faster inference (shorter prompts), higher reliability on unseen layouts, and no per-call cost for system prompt tokens.
The Task Geometry Argument: Why 25,000 Examples Is Enough
Dataset size requirements scale inversely with task entropy. Generative tasks — creative writing, open-domain QA — have enormous output spaces; millions of examples barely constrain the distribution. Structured extraction is nearly deterministic: given HTML and a target field, the correct answer is either present or absent, and its form is governed by the schema. The output space is a small, well-bounded region of token space. This means 25.2k trajectory pairs are sufficient to cover the structural variance in real web markup across the target domain verticals. Beyond this point, additional examples yield diminishing returns — you are adding redundancy, not new coverage of the problem space.
Starting from the Qwen3-1.7B instruct checkpoint rather than a raw base model is the correct initialization for a related reason. The instruct checkpoint already encodes instruction-following behavior — format compliance, role separation — via Qwen's alignment training. SFT on top acts as a lightweight re-weighting of existing capabilities toward a specific input-output manifold, not a capability acquisition step from scratch. Fine-tuning from a raw base model would require the optimizer to simultaneously learn instruction semantics and task behavior, competing for gradient budget.
The absence of RLHF or DPO is similarly justified. Both techniques address problems that arise when the reward function is expensive to evaluate or requires human preference signals. Here, the reward function is a JSON schema validator plus a field-level string match — cheap, deterministic, and verifiable. The extraction task has a ground truth oracle; leverage it directly through data filtering rather than preference optimization. The 70% discard rate is functioning as a data-side reward model: curating the loss landscape before training begins is substantially more effective than post-hoc preference optimization for tasks with verifiable outputs. Next-token prediction on noisy data does not average out errors — it memorizes them. One malformed JSON trajectory can corrupt the model's output grammar for structurally similar inputs. High-precision filtering at data construction time removes the need for alignment techniques applied to underspecified objectives.
From Raw Data to Training Signal
That conceptual shift only delivers results if the training data faithfully reflects the real-world distribution. Here the provenance of the dataset is critical — and it is what separates this release from typical NLP benchmarks.
The dataset wasn't built by academics writing perfect schemas for static HTML. Every row is a live execution record from the open-source ScrapeGraphAI library solving a real developer request against a real website. The library invokes LLMs (296 distinct models in the raw dataset, from GPT-4o-mini to Gemma to Llama3) as the "extraction engine" and logs the full trajectory — URL, raw HTML, user prompt, output JSON schema, and the structured output. This captures real-world page diversity: cluttered HTML, dynamic content, anti-bot markup, ambiguous schemas. The noise profile is production noise — inconsistent nesting, CMS boilerplate, tracking pixels, A/B testing variants. Models trained on this data encounter the same input distribution they will face in deployment. That alignment is worth more than additional scale on a cleaner but synthetic corpus.
It also introduces a dependency that matters: the quality of every training label is bounded by the quality of the teacher model that produced it. The fine-tuning labels were regenerated by GPT-5-nano across all filtered samples — a single-teacher distillation, not human-annotated ground truth.
The 17-field schema is an engineering asset beyond the training pairs. execution_time and response_size enable cost-stratified training subsets. schema_complexity_score enables curriculum ordering: flat single-entity schemas before nested arrays and recursive references. The source and content fields are replayable: re-run prompts with perturbations, build contrastive pairs, or audit specific site types. The llm_model field (296 distinct values) enables targeted distillation — train only on trajectories solved by a specific teacher model, or exclude outputs from weaker ones.
The fine-tuning uses QLoRA: 4-bit quantized base weights with LoRA adapters (rank 16) on all linear projections, trained with completion-only cross-entropy loss — gradients flow only through the assistant's response tokens. No RLHF, no DPO on top of the fine-tune (though the Qwen3-1.7B instruct base itself was aligned via GRPO). At 1.7B parameters the training runs in a few hours on a single A100. The BF16 model is the primary artifact; the GGUF export is the deployment artifact.
This approach creates a compounding feedback loop: the open-source library generates trajectories, trajectories fine-tune a student model, the better model ships back into the library, and the improved library generates higher-quality future traces. The critical engineering requirement is versioning — every trace must record which library version and which teacher model produced it, or the training signal becomes confounded as the distribution shifts across releases.
The 65,700 discarded failure rows represent a missed opportunity. Used as hard negatives in contrastive training, they directly teach the model the failure modes it needs to avoid. Used with partial-credit labels — a strong LLM scoring extraction completeness on 0–1 rather than binary success/failure — they enable curriculum learning through graded partial failures. The failure pile should be the first thing any follow-on fine-tuning effort audits.
The Benchmark — and Its Limits
The ScrapeGraphAI team reports the fine-tuned 1.7B model achieves 94.6% field-level F1 on SWDE (80 websites, 8 domains, 124k web pages), versus 91.2% for leading cloud extraction APIs. Important caveat: the cited arxiv paper (2602.15189) does not evaluate on SWDE — it evaluates on the project's own held-out test set, where the fine-tuned 1.7B reaches 88.7% Key F1 versus 89.2% for Qwen3-30B. The 94.6% SWDE figure and the cloud API comparison appear in the team's communications but are not reproduced in the peer-reviewed paper. The source of these benchmark numbers should be verified independently before relying on them.
A 3–4 point gap sounds modest until you understand SWDE context: top performers have historically been separated by 1–4 percentage points. The explanation is structural — web extraction for a fixed schema is a constrained language game, and the fine-tuned model has all capacity focused on one narrow problem space while a large generalist spreads capacity across every task it was trained on. The cost implication compounds this: for a pipeline processing 100,000 pages a day, shifting from a cloud API to a local GGUF means the difference between $1,000–$2,000 in daily API costs and effectively zero.
The results are compelling. They also have gaps worth naming before deploying anything to production.
SWDE is a 2011 dataset. The 80 websites were crawled as static HTML snapshots over a decade ago. Modern web is 60–80% JavaScript-rendered — React, Vue, Angular SPAs where content is assembled client-side. SWDE has zero SPAs. A model that scores 94.6% on SWDE may score significantly lower on contemporary pages where target content lives in dynamically injected <div> trees.
The cloud API comparison is unnamed. The paper reports "leading cloud extraction APIs" at 91.2% F1 but doesn't identify them by name, version, or prompt configuration. Extraction API performance is highly prompt-sensitive — the same API with a carefully engineered system prompt versus a generic one can differ by 5–10 F1 points. Without knowing the exact configuration, the 3.4pp gap cannot be independently reproduced.
The teacher LLM ceiling. The raw trajectories were generated by 296 different LLMs, and the fine-tuning labels were regenerated by GPT-5-nano. The fine-tuned model inherits GPT-5-nano's systematic biases on this task. When the teacher hallucinated a plausible-looking structured output, that hallucination entered the training set as ground truth. The response_is_valid flag reflects whether the library returned parseable output — not whether the extracted data was factually correct.
No adversarial or anti-bot evaluation. Both benchmarks test extraction from cooperating HTML. Production scraping must contend with Cloudflare Turnstile, DOM obfuscation, and shadow DOM. The reported scores have zero signal on the dimensions that cause the most real-world failures.
Train/test distribution overlap. The 2,808-example test split is drawn from the same crawl pipeline as the training set. When training and test data share website origin and HTML formatting quirks, the evaluation measures in-distribution generalization at best and memorization at worst. A credible evaluation would require a temporal split with out-of-distribution domains.
None of this invalidates the release. The artifacts are real, the code runs, and the benchmark improvement over prior approaches is genuine. But treating 94.6% SWDE F1 as a proxy for production reliability is the wrong conclusion to draw.
The Competitor Landscape
Three recent papers attack adjacent problems: AXE prunes irrelevant DOM subtrees before passing them to an LLM, achieving 88.1% F1 on SWDE with a 0.6B model and zero-shot cross-domain transfer via specialized adaptors. Dripper reframes main content extraction (boilerplate removal, not schema-driven structured extraction) as constrained sequence labeling on HTML tokens — deterministic and hallucination-free, rivaling frontier models at 0.6B. SLOT is a general-purpose method for structuring LLM outputs (not web-extraction-specific): a fine-tuned 7B post-processing model repairs schema violations, achieving 99.5% schema accuracy at the cost of a two-stage pipeline and higher latency.
The ScrapeGraphAI model sits in the middle: end-to-end fine-tuning of a compact model for direct JSON generation. It beats AXE on accuracy (+6.5pp), matches Dripper's cloud-API parity, and achieves high schema accuracy without SLOT's separate repair stage — all in a single 1.7B model that runs locally.
The Execution Trace Paradigm: A Template Repeating Across Domains
The ScrapeGraphAI result is a proof-of-concept for a pattern that will compress cloud AI economics across domain after domain: an open-source tool running against real inputs logs its execution traces, those traces fine-tune a compact student model, and the student outperforms generalist giants on the narrow task at a fraction of the inference cost.
The pattern works when three conditions align: the task boundary is crisp, correctness is at least partially verifiable, and the training distribution closely matches deployment. Web extraction satisfies all three. Where the pattern fails is in tasks requiring compositional generalization across domains, novel reasoning under distribution shift, or open-ended judgment. But for well-scoped, high-frequency production tasks — the kind enterprises actually run at scale — narrow experts will systematically win on cost, latency, and reliability.
The dataset is the real moat, and it is data-based rather than compute-based. Compute can be rented; proprietary execution traces from a live production system cannot be replicated quickly by an outsider. The Apache 2.0 license on both model and dataset accelerates this further: it eliminates legal friction for enterprise adoption, attracts contributors who generate more trajectory data, and preempts any would-be follower's plan to hoard proprietary training data as a differentiator.
The global availability of strong base models — Qwen3 from Alibaba, Llama from Meta — has collapsed the capability gap between a well-resourced startup and a frontier lab to the domain-specific data problem alone. Any open-source library with deterministic outputs — compilers, query planners, test runners, API clients — is now a latent dataset waiting to be materialized. The teams that recognize this first in their domain will move years ahead of those waiting for a labeled dataset to appear.
What this release doesn't resolve is the production gap. How much of the 94.6% SWDE score transfers to JavaScript-heavy, bot-defended pages that dominate real scraping workloads? The benchmark gap is real; the production gap is unmeasured. Treat the SWDE score as a prior, not a deployment signal. For any repetitive, well-defined extraction task on cooperating HTML, the question is no longer "which API?" but "do I have 10–25k clean execution traces for a 1–3B fine-tune?"
The HTML Acquisition Layer: What No Benchmark Measures
Every benchmark in this paper — SWDE, SyntheticWebExtract — assumes the HTML arrives clean, complete, and in memory. In production, that assumption fails approximately 20–40% of the time before any extraction model runs.
Getting consistent HTML from real websites requires solving problems orthogonal to extraction quality: rotating residential proxies, managing session state and cookie consent dialogs, handling 429 rate limits and Cloudflare 1020 challenge pages, retrying across geographies when an IP is flagged. When production scraping engineers say a pipeline is "broken," they almost never mean the extraction logic failed — they mean the HTML never arrived.
JavaScript-rendered content compounds this. A conservative estimate puts 60–80% of modern web content behind client-side rendering — React, Vue, and Angular applications that ship minimal HTML shells and inject target data into the DOM after async fetches resolve. Running the GGUF model against the raw server response from a Next.js product page returns <div id="__next"></div> and nothing useful. Playwright or headless Chrome is not optional middleware for modern web targets — it is the data source. The model sits downstream of the browser, not as a replacement for it.
DOM obfuscation is the least-discussed structural threat to any fine-tuned extraction model. Build pipelines hash class names (class="x7f3k" instead of class="price"). A/B testing platforms restructure DOM layout weekly. Frameworks replace semantic class names with data-testid attributes carrying no semantic meaning. A model trained on a site's HTML from six months ago will silently return wrong values on that same site today — the output is still valid JSON, just mapped from the wrong spans. Rule-based scrapers break loudly when selectors miss. Fine-tuned models drift quietly. This requires ground-truth eval pipelines that re-crawl production targets regularly, not a one-time benchmark run.
The practical deployment pattern: the GGUF receives a focused, pre-rendered DOM fragment — not a full raw page — after an HTML normalization layer has stripped scripts, analytics pixels, and navigation boilerplate. A lightweight classifier prunes irrelevant DOM subtrees before the model ever sees content. The model's job is extraction from structured content; rendering and noise filtering are upstream engineering problems with separate failure modes and separate monitoring requirements.
Local Inference: The GGUF Path to Sovereign Extraction
The Q4_K_M GGUF (~1.1 GB) runs entirely on your own hardware. One command gives you a dedicated extraction endpoint:
# Serve locally on an M-series Mac
mlx_lm.server --model scrapegraphai/sgai-qwen3-1.7b-gguf --port 8080
You now have a dedicated extraction API behind your firewall. No data leaves your infrastructure — a non-negotiable requirement for pipelines touching PII, competitive intelligence, or compliance-sensitive content. The cost model shifts from per-API-call to fixed hardware amortization.
On an M1 Pro: ~40 tokens/second for generation. Processing 1,000 pages with 200-token outputs takes roughly 80 minutes of sequential inference — fast enough for overnight batch runs, not for real-time bulk processing. At 100k extractions/month the API savings justify a dedicated Mac Mini in weeks. The architecture also changes: instead of a fragile dependency on an external API's availability, cost, and rate limits, you deploy a container with the GGUF file and an inference server — a private utility as reliable as your own infrastructure.
What Q4_K_M actually does. Q4_K_M is a mixed-precision scheme from the k-quant family: most weight blocks are quantized to 4 bits, but sensitive tensors — half of the attention value projections (attention.wv) and half of the feed-forward down-projections (feed_forward.w2) — retain 6-bit (Q6_K) resolution. For a 1–2B model, this is the correct operating point. Below 4 bits, quantization error accumulates across layers faster than the model can compensate; the residual stream loses fidelity before the output projection recovers it. At Q4_K_M, perplexity degradation versus BF16 is typically under 0.3 points — acceptable for extraction tasks where you're not generating novel prose. The 1.1 GB file fits entirely in unified memory on Apple Silicon with a cold-start from NVMe under two seconds.
The deployment pattern that works. Run mlx_lm.server as a sidecar service on localhost:8080, isolated from the Rust pipeline process. This separates model memory from pipeline memory, allows independent restarts, and keeps the inference surface observable. The Rust pipeline calls it over loopback HTTP — the integration surface is identical to a cloud API; swapping is one environment variable. The model file is a versioned artifact you pin; there is no model drift from provider-side updates. On a 16 GB M1 Pro, the 1.1 GB model leaves ~13 GB for co-deployed models (an embedding model at Q8 is ~300 MB), the OS, and the Rust pipeline — no memory pressure.
The cost cliff for extraction workloads. Web extraction has an input token problem that rarely appears in standard LLM cost estimates. Complex product listings and SPA content can balloon to 15,000–30,000 tokens; most extraction API pricing tiers assume 3,000–5,000 token inputs, billing overages at 1.5–3× the base rate, often silently. The effective all-in cost per extraction for a mid-sized page sits at $0.018–$0.045. At $0.02 average: $6,000/month at 300k extractions, $60,000/month at 3M. A further hidden cost: extraction APIs have 2–5% failure rates from timeouts and quota errors — automatic retries double the cost of failed calls, adding $120–$300/month in untracked spend at 300k/month volume.
The local GGUF eliminates the per-token variable entirely. The dominant local cost is ops overhead: roughly 0.05–0.15 FTE to monitor the server, handle restarts, and update model checkpoints. Fully loaded (hardware depreciation + electricity + 0.1 FTE at $120k/year burdened), monthly local cost is approximately $1,063. Break-even against $0.02/extraction: 53,150 extractions/month fully loaded, or ~3,150/month if inference management absorbs into an existing infrastructure team's scope. At 300k/month the hardware CapEx pays back in under one week of avoided cloud spend. The accuracy advantage compounds this: at 94.6% vs. 91.2% F1 across 300k extractions, the local model produces roughly 10,200 fewer incorrect extractions per month — at $0.10/correction in downstream rework, that is $1,020/month in avoided cost that never appears in an API pricing comparison.
Compliance at inference time. Under GDPR Chapter 5 (Articles 44–49), any transfer of personal data to a third country requires an appropriate safeguard — an adequacy decision, standard contractual clauses, or binding corporate rules. Raw HTML is not neutral — it routinely contains names, email addresses, job titles, and behavioral metadata embedded in schema markup and analytics scripts. Sending HTML pages to a cloud API hosted outside the EU/EEA for extraction almost certainly constitutes a personal data transfer under Chapter V. A GGUF model running on-premise eliminates this vector entirely: no transfer occurs, no Data Processing Agreement is required for the inference step, and the legal exposure simply does not exist. Under DORA (effective January 2025), financial entities must map ICT third-party dependencies to their register of information; a local model is an internal asset, not a third-party dependency. Under HIPAA, sending HTML that may contain PHI to a cloud API without a signed BAA is a potential impermissible disclosure; local inference eliminates the third-party disclosure risk, removing the need for a BAA at the inference step. For any pipeline processing sensitive content at scale, on-premise inference is not a performance optimization — it is a compliance control.
Production: Lead Enrichment at Scale
The real test of any model is how it fits into a production system. The agentic lead generation platform at agenticleadgen.xyz — source on GitHub — uses a hybrid architecture combining traditional scrapers and LLM-based extraction.
The pipeline's fast path uses Rust's scraper crate for zero-allocation DOM parsing, a custom finite-state machine for email extraction at ~100 pages/second, and JSON-LD / Open Graph parsers for structured metadata. Rule-based scrapers handle high-throughput predictable tasks; the sgai-qwen3-1.7b GGUF handles the variable, schema-driven tasks that rule-based systems cannot. The custom NER model (92.3% F1) and the LLM extraction step share the same M1 infrastructure — no additional API spend.
The pipeline's most consequential routing decision is which pages go to the GGUF model. "About Us" pages go to the model — and the reason illustrates why fine-tuned LLM extraction is necessary, not merely convenient. "About Us" pages are the richest firmographic data source on the open web and the most structurally anarchic. E-commerce product pages conform to schema.org markup. Job listings follow JobPosting structured data. "About Us" pages conform to nothing — they are editorial artifacts written across different CMSes, by different people, over years of company history. The founding year might appear as prose ("we've been building since the spring of 2011"), as JSON-LD foundingDate, as an <h2>Our Story</h2> timeline, or not at all. You cannot write a CSS selector for "the number representing how many people work here."
The fields that drive B2B sales prioritization are precisely the ones requiring semantic understanding: founding year (determines company stage and budget cycle position), headcount range (primary ICP filter), technology stack signals embedded in marketing copy, and funding status that appears on company pages before any database picks it up. A rule-based scraper can locate a DOM node; it cannot understand that "Series A from Sequoia" normalizes to {stage: "Series A", lead_investor: "Sequoia Capital"}.
The NER validation layer is not an optional quality gate — it is the difference between a lead database and a hallucination database. When an "About Us" page doesn't mention founding year, the model doesn't output null; it outputs a year that looks correct, drawn from contextual priors. The NER validator's job is span-level grounding: verify the extracted value actually appears in the source HTML. Fields with low NER confidence get flagged for human review rather than auto-populating the CRM. With it, confidence-scored fields drive sequence logic: high-confidence extractions enter automated personalization; low-confidence fields trigger human review queues.
At up to 23,000 enriched profiles per day on a single M1 (combining rule-based fast-path and LLM extraction via hybrid routing), the constraint shifts from research capacity to qualification logic. Commercial databases (Apollo.io, ZoomInfo, Clearbit) offer point-in-time snapshots from their last crawl — often 12–18 months stale for long-tail domains. A custom extraction pipeline gives you freshness (live web, not database snapshots), coverage (any company with a web presence), and delta signals: re-crawl weekly, compare structured outputs over time, and surface companies in motion before any database vendor does.
Decision Framework: When to Use Which Tool
Use sgai-qwen3-1.7b (local GGUF) when:
- Volume is high (>10k extractions/month) — a research firm monitoring prices across 10,000 e-commerce sites runs ~300k extractions/month. At cloud API pricing ($0.02 average) that's ~$6,000/month; the same workload on a Mac Mini with the GGUF runs on fixed hardware cost, amortized within weeks.
- Data sensitivity is high (PII, competitive intelligence) — financial institutions, healthcare providers, and legal teams often cannot send content to third-party APIs. On-premise GGUF inference satisfies data residency requirements with no pipeline changes.
- Domain is narrow — filtering the 100k dataset to a single vertical (e-commerce, job boards, real estate) and fine-tuning further takes hours on a consumer GPU and typically pushes accuracy above 96%.
- Real-time enrichment — a CRM pulling structured company data from a prospect's website during a sales call needs latency under 2 seconds, zero API spend, and no external dependency in the critical path.
Use a frontier API when:
- Volume is low and sporadic — setup cost of a local model isn't justified.
- Schemas change constantly — need zero-shot generalization from a description.
- You lack ML/Ops bandwidth to manage model deployment and quantization.
- Pipeline already works — don't rewrite what clears your accuracy thresholds.
Use traditional scrapers (Scrapy, Rust scraper) when:
- Website structure is simple and stable.
- Throughput and latency are paramount (thousands of pages per second).
- Extraction logic is trivial (e.g., get all
hrefattributes).
The sweet spot for sgai-qwen3-1.7b is the middle ground: tasks too variable for rule-based scrapers, too expensive and sensitive for constant API calls.
Getting Started
Install the ScrapeGraphAI library (GitHub) and point it at the local model server:
pip install scrapegraphai
mlx_lm.server --model scrapegraphai/sgai-qwen3-1.7b-gguf --port 8080
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "scrapegraphai/sgai-qwen3-1.7b",
"base_url": "http://localhost:8080",
},
"verbose": True,
"headless": True,
}
graph = SmartScraperGraph(
prompt="Extract company name, founding year, and employee count",
source="https://example.com/about",
config=graph_config,
)
print(graph.run())
Further Fine-Tuning: Vertical Specialization
The 1.7B parameter count makes domain fine-tuning feasible on a single consumer GPU in a few hours. Filter the dataset to your vertical, then run SFT with Axolotl or TRL:
from datasets import load_dataset
ds = load_dataset("scrapegraphai/scrapegraphai-100k")
# Filter to your vertical
ecomm = ds['train'].filter(
lambda x: x['response_is_valid']
and 'product' in x['prompt'].lower()
and any(d in x['source'] for d in ['amazon.com', 'bestbuy.com', 'newegg.com'])
)
# Feed to Axolotl or TRL against Qwen3-1.7B for a vertical-specialist model
Filter by schema_complexity_score to keep only the cleanest examples, or by llm_model to train exclusively on outputs from specific teacher models.
Ten Perspectives
Communications
Imagine you hired a research assistant whose only job, every single day for months, was to read websites and fill out forms — pulling out product names, prices, addresses, whatever you needed, returned in a clean organized list. No small talk, no essay writing, no coding help. Just that one task, repeated thousands of times, until they knew it the way a seasoned professional knows their trade: by instinct, by pattern, by reflex.
That is essentially what ScrapeGraphAI did when building their Qwen3-1.7B model. They took an existing small language model and trained it exclusively on real examples of a web-scraping library doing its job — here is a webpage, here is the question, here is the correctly structured answer — until that narrow specialty was carved deep into the system. The breadth was the sacrifice. The depth was the point.
And the depth paid off. This model has 1.7 billion parameters — think of parameters as the adjustable dials inside the system. The cloud AI services it was tested against run on server farms the size of warehouses, with parameter counts that dwarf it by orders of magnitude. Yet in head-to-head testing on a standard benchmark for extracting structured data from websites, this small specialist outperformed those sprawling giants by 3.4 percentage points, scoring 94.6% against their 91.2%.
That gap matters because it runs against everything the last decade of AI has seemed to promise. Bigger, we were told, is almost always better. What this result quietly suggests is that the advantage of scale has a ceiling — and that ceiling is reached faster than expected when a smaller model has been trained obsessively on the exact problem at hand. Focused repetition, it turns out, is its own kind of intelligence.
The final detail makes it all the more striking: you can download this model for free and run it on a regular laptop. No subscription, no API fees, no data sent to a distant server farm. The research assistant, having spent months mastering one job, now works for anyone willing to ask.
Computer Science
Writing a function that maps raw HTML to structured JSON is tractable for a narrow schema — you parse the DOM, locate known selectors, extract values. The problem is that the web doesn't cooperate: selectors drift, layouts vary, and the same semantic field appears in a dozen different structural positions depending on the site. ScrapeGraphAI's Qwen3-1.7B sidesteps the brittle-rule problem by learning the mapping directly from data rather than encoding it by hand.
The model has 1.7 billion parameters and was fine-tuned on 25,200 real scraping trajectories — not synthetically generated inputs, but production runs from an actively used extraction library. That provenance matters: the training distribution reflects the exact noise and variability a deployed scraper encounters, which is a different thing from a curated benchmark dataset. The team started with 93,700 trajectories and discarded roughly 70% of them, keeping only runs where the scraper succeeded. Filtering on outcome rather than on input quality is a deliberate choice: it biases the training signal toward cases where the correct extraction is unambiguous.
Supervised fine-tuning on those 25,200 examples proceeds by standard cross-entropy loss over the target JSON token sequence. What the model is learning is, effectively, a context-sensitive selector: given the full HTML input, which spans of text correspond to which schema fields, and in what format should they be emitted. The transformer's attention mechanism handles the long-range dependencies — linking a price buried in a <span> to the product schema declared three hundred tokens earlier in the prompt — without any explicit DOM traversal logic.
The deployment artifact is a Q4_K_M GGUF quantization, which compresses the 1.7B parameters from 16-bit floats down to roughly 4 bits each, bringing the model to about 1.1 GB on disk. That is small enough to run locally on Apple Silicon without a discrete GPU. On the SWDE benchmark, it scores 94.6% F1, compared to 91.2% for cloud APIs backed by far larger general-purpose models.
The gap is not a fluke of benchmark selection. A general model trained on the open web must allocate capacity across an enormous range of tasks; a fine-tuned model trained on one narrow distribution can pack its entire parameter budget into representing that distribution well. 25,200 clean, in-distribution examples, it turns out, is enough to make that tradeoff decisively in the specialized model's favor.
Machine Learning
The Qwen3-1.7B instruction checkpoint is an interesting base to fine-tune for structured extraction: small enough to run on commodity hardware, large enough to parse complex HTML with reasonable context, and pre-trained on code-heavy corpora that partially overlap with the tokenization patterns of real-world markup. The question was never whether SFT could specialize it — that much is routine — but whether the resulting narrow distribution could be made tight enough to outperform generalist models that are orders of magnitude larger.
The dataset construction is where the real engineering happened. Of 93,700 collected scraping trajectories, 70% were discarded. Only 25,200 examples where the scraper provably succeeded were retained. This is an aggressive quality filter, and it matters for SFT in a specific way: cross-entropy loss will happily overfit to noisy labels, and in structured-output tasks where the target is JSON, even a handful of malformed trajectories can introduce systematic errors in bracket placement, key naming, and type coercion. A 70% discard rate implies the team was treating trajectory quality as a hard constraint rather than a regularization problem — the right call when your output space is well-defined.
The training objective is standard SFT with no RLHF or DPO. For this task that is a deliberate simplification, not an oversight. Preference optimization is most valuable when the reward signal is ambiguous — when "better" depends on human judgment that cannot be captured by a reference answer. HTML-to-JSON extraction has an unambiguous ground truth: either the output matches the schema and the values are correct, or it does not. The cross-entropy loss over clean trajectories is already an excellent proxy for the downstream metric. Adding a preference stage would have introduced complexity without addressing the bottleneck.
What compounds these choices is distributional alignment at every layer. The base model's pre-training included structured text and code, so the fine-tuning is adjusting an already-relevant prior rather than fighting against one. The training trajectories are real scraping runs — not synthetic augmentations or paraphrased rewrites — so the input distribution at training time matches deployment. The 94.6% F1 on SWDE against 91.2% for much larger cloud APIs is the signature of a model whose inference-time distribution is almost entirely covered by its training distribution. Generalist models, however large, are performing a kind of zero-shot transfer to the extraction domain; this model is doing no transfer at all.
Quantization to Q4_K_M GGUF — roughly 4 bits per weight, 1.1 GB on disk — closes the deployment story. At 1.7B parameters, post-training quantization at this bit-width typically costs 0.5–1.5 F1 points on structured generation tasks. The fact that the model reaches 94.6% at Q4_K_M rather than at full precision suggests either headroom in the fine-tuned checkpoint, or that the task's token distribution is regular enough that 4-bit weight resolution is sufficient. Either interpretation is favorable: it means the accuracy number is achievable on Apple Silicon without a GPU cluster, which is where most practical web-scraping pipelines actually run.
Cognitive Science
The distinction between knowing where to find an answer and having that answer built into how you think is one of the oldest tensions in cognitive science. A novice chess player scans the board laboriously, retrieving heuristics from working memory on each move. A grandmaster perceives structure — threats, opportunities, endgame shapes — as immediate Gestalt. The difference is not access to information but its location: in-context versus in-weights. That same axis describes, in miniature, what ScrapeGraphAI's fine-tuned Qwen3-1.7B has undergone.
The base model approaches a scraping task the way a novice approaches the board: each generation step consults the prompt for cues, reconstructing procedure from first principles in working memory. The fine-tuned variant has internalized 25,200 real scraping trajectories, compiling that procedural knowledge into the weights themselves. What was effortful and reconstructive becomes fluent and direct — not because the model gained access to new information, but because the locus of that information shifted. This is the chess grandmaster's Gestalt, instantiated in a 1.7B parameter space.
The content of those trajectories matters as much as their quantity. They encode three forms of expertise that novice models must reason through explicitly: noise suppression, semantic block recognition, and format consistency. Schema theory calls these compiled action schemas — chunked, condition-indexed routines that fire without deliberate assembly. A learner who has encountered enough clear exemplars no longer reconstructs the rule; the rule fires on perception. The 25,200 trajectories are not training data in the generic sense but the raw material of expertise compilation.
Which is precisely why the 70% discard rate is not peripheral to the story — it is the mechanism. High selection pressure during acquisition is a well-documented accelerant of robust generalization: learners forced to encode only clear, unambiguous exemplars build cleaner category boundaries. Noisy exemplars contaminate the compiled schema, leaving residual uncertainty that must be resolved at runtime. Discarding 70% of the raw data is not quality control applied after the fact; it is curriculum design that determines what kind of knowledge gets consolidated in the first place.
The final move — using a teacher LLM rather than human annotation to generate labels — closes the narrative loop. This is behavioral distillation: the student learns not facts but decision procedures, compressed from a larger model's implicit expertise. Classical transfer learning moves representations across domains; here the transfer is vertical, carrying competence downward across capacity levels within the same domain. The constrained output space — structured JSON conforming to a known schema — limits how far generalization must reach, which is why the compression holds. What the grandmaster perceived as immediate Gestalt, the student has learned to see by watching someone who already did.
Statistics
The evaluation leads with two headline figures: 94.6% field-level F1 on SWDE and an 89.3% schema accuracy on SyntheticWebExtract against an 87.1% baseline. Both numbers are real, both are reproducible under the reported conditions — and both are maximally favorable to the model precisely because of the conditions chosen.
SWDE is the natural place to begin, because the benchmark's limitations propagate into every interpretation that follows. Its 124,000 web pages span 80 websites and 8 domains, all crawled in 2011 from static HTML. Today, 60–80% of web content is JavaScript-rendered; SPA architectures did not exist in the corpus. A 94.6% F1 on well-formed, deterministic markup does not operationalize what "web extraction" means in 2026. This is a construct validity problem: the benchmark measures something adjacent to the target construct, and the gap between the two is exactly the gap the model would be expected to struggle with in production.
The SyntheticWebExtract comparison appears more controlled but introduces a different class of problem. The 3.4 percentage-point advantage over the cloud API baseline carries no reported confidence intervals and no disclosed variance across test instances. Without those, we cannot conduct a two-proportion z-test, cannot bound the standard error on the gap, and cannot rule out that the result is within measurement noise. The baseline compounds the uncertainty further: the cloud API prompt configuration is undisclosed, and extraction precision and recall are jointly sensitive to how the target schema is specified. An unconstrained baseline is not a controlled one; the comparison absorbs that ambiguity and reports a single point estimate as though it were decisive.
Even setting those issues aside, the train/test split collapses the argument from generalization. The 2,808 test examples and the 25,200 training examples were drawn from the same crawl pipeline. When the source distribution is shared, overlap is near-certain, and reported accuracy confounds generalization with interpolation. What looks like out-of-sample performance may be, in part, a measure of how faithfully the model memorized structural patterns specific to that crawl — patterns that will not transfer to an unseen site in a different industry, behind a different CDN, or rendered by a different JavaScript framework.
The honest reading of the results is therefore: under conditions optimized for measurable performance — a dated benchmark, an underpowered comparison, and a train/test split that shares a source distribution — the model performs well. That is not a trivial achievement, but it is a substantially weaker claim than a production accuracy figure. What would constitute credible evidence is straightforward to specify: evaluation on held-out sites unseen during training, confidence intervals on all point estimates, a disclosed and reproducible baseline prompt, and a JavaScript-heavy test set drawn from a distribution plausibly representative of the current web. Until those conditions are met, the interesting statistics are the ones that were not reported.
NLP & Deep Learning
ScrapeGraphAI fine-tuned Qwen3-1.7B on 25.2k instruction-formatted trajectory pairs — each a concatenation of system prompt, HTML document, extraction directive, and target schema — using standard causal LM cross-entropy over response tokens only, with the prompt masked from gradient computation. No RLHF, no DPO. That absence is a design decision, not an oversight, and the rest of the architecture makes the reasoning legible.
The choice to start from the Qwen3-1.7B instruct checkpoint rather than a base model is load-bearing. The instruct checkpoint already encodes general instruction-following behavior in its weight space; SFT from this initialization re-weights the parameter manifold toward a narrower input-output subspace — structured HTML-to-JSON extraction — without requiring the model to re-acquire instruction-following from the gradient signal. The task prior gets compressed into weights rather than delivered through in-context demonstrations. This compression is precisely what makes skipping RLHF coherent: when the reward signal is a schema validator and string match rather than a learned human preference model, a KL-regularized policy gradient stage adds training complexity without a corresponding alignment benefit. The objective is well-specified enough that careful data curation does the work a reward model would otherwise do.
That curation takes the form of a 70% discard rate, which functions as a data-side reward model applied before training rather than after. By constructing a loss landscape shaped only by high-quality extraction behavior, the approach trades the stochasticity of online RLHF for a deterministic prior over what correct extraction looks like — a sensible tradeoff when the scoring criterion is crisp and the teacher generations are noisy enough to warrant aggressive filtering. Together, the instruct initialization, the masked SFT objective, and the curated corpus form a consistent argument: constrain the optimization problem tightly enough at data and initialization time, and the training loop itself stays simple.
The binding constraint is the teacher LLM ceiling. The 25.2k pairs were generated via distillation from GPT-5-nano, which means the student's learned extraction behavior is bounded by the teacher's systematic biases on this task — errors the schema validator cannot catch, ambiguous HTML structures the teacher resolved inconsistently, edge cases the teacher never surfaced. How those inherited biases interact with aggressive quantization, where structured output generation is already empirically fragile, remains an open question specific to this fine-tune and is the most consequential gap between the design's internal coherence and its real-world ceiling.
Information Theory
HTML-to-JSON extraction is a near-deterministic mapping: the output distribution is constrained to a small, bounded region of token space — JSON conforming to a known schema — so the task entropy is fundamentally low. Low entropy has a precise consequence under minimum description length reasoning. The Kolmogorov complexity of the target function is modest — a compact program exists that describes the transformation — which means the optimal hypothesis class is narrow. The learning problem reduces to covering the structural variance of the input, not to searching a vast hypothesis space. That framing sets the thread that runs through every subsequent design decision.
Sample complexity is the first downstream implication. Because the effective dimensionality of what must be learned is set by the residual input variance, not by the size of the raw domain, 25,200 examples can constitute a genuine sufficiency condition. HTML documents are syntactically diverse but semantically repetitive: the patterns that carry mutual information with the target JSON recur across sites. Once those patterns are represented in the training distribution, additional examples contribute negligible information about the function being approximated, and the MDL-optimal model stops improving. The 25.2k figure is not an arbitrary stopping point — it reflects where the empirical distribution saturates the low-complexity hypothesis class the task requires.
The 70% discard rate is how that hypothesis class is enforced at the data level. Retaining only high-quality, unambiguous extraction pairs is equivalent to maximizing the mutual information between each training example and the target function: noisy or ambiguous pairs reduce I(X;Y) per example, inflate the effective entropy of the training distribution, and push the MDL description length upward. Aggressive filtering concentrates probability mass precisely where the true function lives, trading raw sample count for a cleaner empirical joint distribution — reducing variance in the learned mapping without introducing systematic bias.
Distillation closes the argument. Knowledge distillation from frontier models transfers not parameters but the information content of correct decisions across the training distribution — lossy compression that discards the high-capacity teacher's representational geometry while retaining exactly the bits relevant to the low-entropy task. Because the target function has modest Kolmogorov complexity, a 1.7B student has sufficient capacity to encode it; what distillation provides is a sample-efficient path to the right region of weight space. The SWDE result of 94.6% F1 is the empirical confirmation: when task entropy is low, the residual entropy after conditioning on sufficient context approaches zero, and the F1 ceiling approaches one.
Computational Linguistics
HTML is not a neutral medium. Its tag hierarchies, class names, and DOM proximity encode distributional semantics in markup — weak signals, but systematically exploitable. The central question for any web extraction model is whether to read those signals through span-level extraction, anchoring predictions to attested token positions in the source, or through generation, producing output sequences unconstrained by the input's surface form. That choice is architectural and irreversible: everything downstream follows from it.
ScrapeGraphAI's Qwen3-1.7B resolves it in favor of generation. The model ingests an HTML token stream and emits structured JSON directly, forgoing the per-token labeling that underlies most NER pipelines and competitors like Dripper. The generative paradigm earns its keep on tasks sequence labeling handles badly: reasoning across non-contiguous spans, normalizing date formats, resolving referential descriptions that no single tag boundary delimits cleanly. These are exactly the cases where a span-grounded model stalls, because the signal is compositional rather than local. The tradeoff is that generation severs the formal link between input positions and output tokens — a link that extractive models get for free.
That severance defines the fine-tuning burden. A sequence labeling model can lean on structural markers even when they're noisy; a generative model must internalize noise suppression and semantic block recognition simultaneously, learning that <div class="pdp-price"> is semantically loaded whether the class name is legible or randomized by an obfuscation layer. When DOM structure is systematically detached from meaning, the generative model has no extractive fallback — it must reconstruct semantics from positional and distributional cues alone, the same cues it would use if the markup weren't there at all. Sequence labeling degrades more gracefully here, because its predictions remain anchored to whatever structure survives.
Schema-constrained decoding is a partial remedy on the generation side, forcing output token distributions to respect a predefined JSON schema at each decoding step — it recovers format consistency but cannot recover grounding. A generative model will produce plausible field values even when the ground truth is absent from the page; the schema says nothing about whether the emitted span exists in the source. This is precisely where an NER-style validator adds value: not as a redundant check, but as a reinjection of the extractive constraint that generation abandoned. On SWDE, field-level F1 reflects the underlying tension directly — generative models lead on normalized and compositionally complex fields, sequence labeling retains an edge wherever structural markers remain reliable enough to anchor a label.
Economics
Structured data extraction has been a quiet rent-extraction business. Cloud APIs priced at $0.018–$0.045 per call sustained those margins through information asymmetry: buyers could not easily benchmark proprietary models against alternatives, the vendor controlled both measurement and billing, and switching costs compounded with every integration. The market equilibrium was stable — not because the technology justified it, but because no credible outside option existed.
That outside option now exists. A 1.7B parameter local model scoring 94.6% F1 against cloud baselines of 91.2% from models hundreds to over a thousand times larger collapses the quality-based moat that justified cloud pricing. More consequentially, running evaluations locally at zero marginal cost dissolves the principal-agent problem at the heart of API billing. When buyers can generate ground-truth benchmarks on their own data, information asymmetry disappears and the vendor's pricing power goes with it.
The cost arithmetic is unambiguous. Beyond approximately 53,000 extractions per month, local inference is strictly dominant on price — and that threshold falls well within the operational range of any serious data pipeline. Open-sourcing the training dataset under Apache 2.0 removes the remaining friction: switching costs drop to zero, and every downstream fine-tune or benchmark strengthens the commons without requiring central coordination. The marginal cost of distributing model weights approaches zero, so aggressive open-sourcing is not generosity — it is the rational strategy for a firm seeking adoption at scale.
The flywheel that follows is the durable competitive advantage. Library adoption generates real scraping trajectories; trajectories train superior models; superior models drive further adoption. ScrapeGraphAI captures network effects without charging for them directly — a two-sided market dynamic that compounds quietly while incumbents defend a pricing structure the market has already decided to abandon. The equilibrium shift is not coming. It has arrived.
AI Research
The ScrapeGraphAI Qwen3-1.7B release is worth examining not as a product announcement but as a data point on a question that matters to every lab working below the 7B threshold: how much task specialization is achievable through supervised fine-tuning alone, and what does the answer reveal about the conditions under which SFT is sufficient?
The method is SFT on an instruct checkpoint — not pretraining from scratch, not reinforcement learning. This framing matters. Instruction fine-tuning has already shaped the model's weight geometry toward instruction-following manifolds; SFT here is manifold re-weighting, not capability acquisition. The target task has near-deterministic output structure — low task entropy — which means the function to be learned has a small effective support. 25,200 training examples is plausibly sufficient for saturation under those conditions. The 70% discard rate via character-length filtering is doing implicit curriculum selection: it is the data-side equivalent of a reward model, approximating what PPO or DPO would otherwise approximate online. These three design choices — instruct base, low-entropy target, difficulty-weighted filtering — are mutually reinforcing, and understanding them as a system is more informative than evaluating any one in isolation.
The distillation approach and the decision to skip RLHF follow the same logic. What transfers from frontier teacher models is behavioral — a functional mapping from HTML inputs to JSON outputs — not representational geometry. The fine-tuning targets were regenerated by GPT-5-nano across all filtered samples (the raw 93.7k dataset was collected from 296 different LLMs, but the fine-tuning labels come from a single teacher). The distillation works across the tokenizer mismatch because the transfer happens at the semantic level of input-output pairs, not at the token-representation level. Skipping preference optimization is defensible on information-theoretic grounds for exactly the same reason the data filtering works: when reward is verifiable, filtering is strictly superior to learned preference models. JSON schema validation plus string match is a deterministic oracle. You do not need ranked preference pairs when a rule-based validator exists — this is the same logic that makes unit test coverage a sufficient training signal for code generation.
The validity threats are real, and they share a common structure: each one represents a gap between what the training pipeline can verify and what the deployment environment actually requires. SWDE is 2011 static HTML — zero SPAs, roughly fifteen years of distributional drift from the modern web. The train/test split from a single crawl pipeline creates non-trivial memorization risk on SyntheticWebExtract. Most structurally important is the teacher ceiling: response_is_valid=True validates parseable JSON, not factually correct JSON. Systematic hallucination in the teacher propagates undetected into the student, and schema conformance cannot surface it. SPAs and adversarial HTML extend the same gap further — extraction failures that a deterministic validator cannot detect at training time.
That gap is where the methodological frontier lies. The distillation flywheel demonstrated here is effective precisely because the oracle is rule-based. Extending it to tasks where correctness cannot be determined by schema conformance — where extraction accuracy must be evaluated against a grounded knowledge base, not just structural validity — requires replacing the rule-based reward with a learned one. That is the step that converts behavioral distillation into something closer to factual grounding, and it is also the step that reintroduces the preference optimization this release deliberately avoided. The question this release answers cleanly is how far you can go without it. The answer is: further than expected, but not past the boundary of what a deterministic validator can see.
FAQ
Q: What is ScrapeGraphAI?
A: An open-source Python library (scrapegraphai.com) that uses LLMs to build web scraping pipelines via natural language prompts. The sgai-qwen3-1.7b model is fine-tuned specifically for its extraction tasks.
Q: How does the 1.7B model beat larger cloud APIs? A: Fine-tuning on 25.2k real scraping trajectories teaches the model one constrained task — HTML in, JSON out — densely covering the problem space that larger generalists spread capacity across every task.
Q: What hardware does local inference require? A: The Q4_K_M GGUF (~1.1 GB) runs on any Apple Silicon Mac or consumer GPU (RTX 3090). On an M1 Pro expect ~40 tokens/second generation — 1,000 pages with 200-token outputs takes roughly 80 minutes sequentially.
Q: When should I fine-tune further on my own data?
A: When your domain has consistent HTML structure (e.g., a single vertical like e-commerce or job boards). Filter the 100k dataset by domain and schema_complexity_score, then run SFT on the filtered subset for an additional accuracy lift.
Q: Is the model free to use? A: Yes — the model weights and dataset are Apache 2.0 licensed; the ScrapeGraphAI library itself is MIT licensed. All are fully open.
