Fine-Tune Qwen3 with LoRA for AI Cold Email Outreach

March 31, 2026 · 26 min read

Senior Software Engineer

An AI cold email engine does one thing: it reads what you know about a company and writes a personalized outreach email — automatically, at scale. If you've ever spent an afternoon manually tweaking 50 nearly-identical emails, you understand the problem. If you've paid for Instantly, Smartlead, or Apollo, you've already solved it — just not on your own terms.

Those SaaS tools charge $30-200/month, send your prospect list to their servers, and give you a black-box model you can't touch. You can't train it on your best-performing emails. You can't add custom quality gates. You can't run it offline. For engineers and technical founders, that's a bad deal.

This system is the alternative: a locally-run pipeline where you own every layer — model weights, scoring logic, and approval gates. The core is Qwen3-1.7B, fine-tuned with LoRA adapters on MLX (Apple's framework for M1/M2 Metal acceleration). A Rust orchestration layer drives the full batch loop: pulling company records, invoking the model, running quality filters, and surfacing emails for human review before anything sends.

The result is not a toy. On a single M1 MacBook Pro, the pipeline generates 200+ personalized emails per batch in under 10 seconds — no GPU cloud, no API latency, no per-email cost. Fine-tuning converges in under 30 minutes on the same machine.

Why Local Inference for Cold Email Outreach

Imagine you're writing a personalized sales email for each of 500 leads. Every email needs the contact's name, their company, what they do, and why your product fits. Now imagine paying a small fee every time you write one of those emails, waiting half a second for each response, and handing all of that data — your entire sales pipeline — to a company whose servers you don't control. That's what happens when you call GPT-5.4 or Claude for every email draft.

Cloud APIs charge per token (roughly per word). At $0.01 to $0.03 per email, a 500-contact batch costs $5 to $15. Do that daily and you're paying hundreds of dollars a month for text generation alone. Latency stacks up too: 400-800ms per request means a 500-email batch takes 3-7 minutes just waiting for the network.

Local inference cuts all three costs simultaneously. The model runs on your hardware — an M1 Mac, a rented GPU, a VPS. Requests never leave the machine, so latency drops to ~50ms (memory bandwidth, not network round-trips). Cost per inference is zero after the hardware is amortized. And your leads list stays private.

The fourth advantage is customization. Cloud APIs are fixed. A locally-hosted model can be fine-tuned with LoRA on your actual sent emails, learning your tone, your ICP framing, your offer language — something prompt engineering alone cannot replicate.

Metric	Cloud API (GPT-5.4)	Local (Qwen3 + LoRA)
Latency per email	400-800ms	~50ms
Cost per email	$0.01-0.03	$0 (hardware amortized)
Data privacy	Sent to third party	Never leaves machine
Customization	Prompt engineering only	Full weight adaptation
Offline capable	No	Yes

The trade-off is upfront engineering time. If you're sending a handful of emails, use an API. If you're running a pipeline that generates hundreds per batch, owning the model pays for itself within weeks.

Architecture: Three Models, Three Jobs

The obvious question: why not just call GPT-5.4 for everything?

Three reasons. Latency — a cloud round-trip per company adds seconds to a pipeline that processes thousands of records. Cost — at scale, API fees compound fast. Privacy — company data and contact details stay local, never leave the machine.

The less obvious answer: specialization beats generalism at small scale. A 1.7B model fine-tuned for one narrow task outperforms a 70B model asked to do everything. Smaller means faster inference, smaller memory footprint, and the ability to run two models simultaneously on the same M1 chip.

Loading diagram…

Model	Port	Task
Qwen2.5-3B-Instruct-4bit	8080	Company classification (category, AI tier)
Qwen3-1.7B-4bit + LoRA	8080	Email drafting (subject + body + personalization score)
sgai-qwen3-1.7b-gguf	8081	HTML to structured data extraction

Classification needs the most reasoning capacity — it must weigh company signals and assign a confidence-weighted AI tier — so it gets the 3B model. Email drafting and HTML extraction are structurally simpler (follow-a-pattern tasks), so 1.7B with a LoRA adapter is sufficient.

The serving layer is mlx_lm.server, Apple's MLX inference server. It keeps model weights in unified memory and dispatches directly to Metal GPU with zero-copy transfers — no CPU-to-GPU memcpy, no PCIe bottleneck. The two 1.7B models weigh roughly 1GB each at 4-bit quantization; the 3B model lands near 2GB. Total footprint stays well under 16GB unified memory, leaving the Rust orchestrator and database layer room to breathe.

The Rust Orchestration Layer

The email pipeline lives in Rust, not Python. The choice is deliberate: Python is fast to prototype, but production pipelines need predictable memory use, real async concurrency without the GIL, and type safety that catches mistakes at compile time rather than 2 AM on a Monday. Rust gives all three. Tokio's async runtime lets the pipeline saturate the local LLM server with concurrent requests without spawning threads per call, and the type system enforces that a ChatRequest is always well-formed before it ever hits the wire.

The LLM client itself is intentionally generic. It speaks the OpenAI /chat/completions protocol — the same HTTP contract that OpenAI defined and that every major local inference server (llama.cpp, vLLM, mlx-lm) has since adopted as a de facto standard. That means the client below works unchanged whether it is talking to a fine-tuned Qwen model running on Apple Silicon via mlx_lm.server, a quantized Mistral in llama.cpp, or a remote OpenAI endpoint. No vendor lock-in, no abstraction tax.

#[derive(Debug, Serialize)]
struct ChatRequest<'a> {
    model: &'a str,
    messages: Vec<Message<'a>>,
    #[serde(skip_serializing_if = "Option::is_none")]
    temperature: Option<f32>,
    #[serde(skip_serializing_if = "Option::is_none")]
    max_tokens: Option<u32>,
}

pub async fn chat(
    client: &reqwest::Client,
    base_url: &str,
    api_key: Option<&str>,
    model: &str,
    system: &str,
    user: &str,
    temperature: Option<f32>,
) -> Result<String> {
    let mut builder = client.post(format!("{base_url}/chat/completions"));
    if let Some(key) = api_key {
        builder = builder.bearer_auth(key);
    }
    let resp = builder.json(&req).send().await?;
    // ...
}

The email drafting function builds on that generic client with a tightly constrained system prompt. The constraint that matters most is JSON-only output: no markdown fences, no preamble, no apology text — a raw JSON object with subject, body, and personalization_score, or nothing. Why JSON-only? Because every downstream consumer — the database writer, the deliverability checker, the campaign scheduler — deserves a parse that either succeeds cleanly or fails loudly. Ambiguous free-text responses silently corrupt pipelines.

pub async fn draft_email(
    client: &reqwest::Client,
    base_url: &str,
    api_key: Option<&str>,
    model: &str,
    contact_name: &str,
    contact_title: &str,
    company_name: &str,
    company_domain: &str,
    tech_stack: &str,
) -> Result<EmailDraft> {
    let system = "You draft B2B outreach emails. \
        Respond ONLY with JSON, no markdown fences.\n\
        Schema: {\"subject\":\"string\",\"body\":\"string\",\
        \"personalization_score\":0.0-1.0}\n\
        Rules:\n\
        - Subject: < 60 chars, no spam triggers, no ALL CAPS\n\
        - Body: 100-250 words, professional but human\n\
        - Opening: personal connection point\n\
        - Value prop: specific to their tech/challenges\n\
        - CTA: single, clear, low-friction (15-min call)\n\
        - No generic flattery. Reference specific tech they use.";

    let raw = chat(
        client, base_url, api_key, model,
        system, &user, Some(0.7)
    ).await?;
    let draft: EmailDraft = serde_json::from_str(json_str)?;
    Ok(draft)
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EmailDraft {
    pub subject: String,
    pub body: String,
    #[serde(default)]
    pub personalization_score: f64,
}

The rules in the system prompt are specific because vague instructions produce vague compliance. A subject line under 60 characters is not a style preference — it is the threshold where major email clients stop truncating. A body between 100 and 250 words targets the sweet spot for cold outreach response rates. One CTA eliminates decision paralysis. Temperature 0.7 adds enough variation to avoid identical emails across a batch while keeping the model focused enough to stay inside the JSON envelope.

The base model respects these constraints roughly 85% of the time — acceptable for experimentation, unacceptable for production. LoRA fine-tuning closes that gap: by training on hundreds of examples that always satisfy the format, the model internalizes the constraints rather than following them when convenient. The result is near-100% JSON parse rate, which means the pipeline runs without a retry loop and without a human babysitting the output queue.

Quality Gates: Catching Bad Emails Before Humans See Them

Even a fine-tuned language model cannot be trusted to produce send-ready emails without a filter in between. The model has learned patterns from training data, but it has no concept of Gmail's spam classifier, no awareness of your sender reputation, and no understanding of CAN-SPAM. It will occasionally generate a subject line in ALL CAPS, include a phrase like "act now" that triggers spam filters, or produce a body that ends without asking for anything. Left unchecked, these drafts erode your domain reputation one send at a time. Quality gates exist to catch failures that are predictable in category even if they are unpredictable in timing.

The defense runs three layers deep. First, every draft passes through automated structural checks before a human ever reads it. Second, a weighted spam score estimates the probability that Gmail's classifier will route the email to junk. Third, a mandatory approval gate ensures no message reaches a recipient without a human making a deliberate decision.

The outreach orchestrator iterates over target contacts, drafts emails, then runs automated quality validation before presenting anything for human review:

for contact in targets {
    let tech_stack = company_map
        .get(contact.domain.as_str())
        .map(|c| c.tech_stack.join(", "))
        .unwrap_or_default();

    match llm::draft_email(
        &ctx.http, &ctx.llm_base_url,
        ctx.llm_api_key.as_deref(),
        &ctx.llm_model,
        &contact.email.split('@').next().unwrap_or(""),
        "", &contact.company_name,
        &contact.domain, &tech_stack,
    ).await {
        Ok(draft) => {
            let quality = check_quality(
                &draft.subject, &draft.body
            );
            report.drafts.push(OutreachDraft { /* ... */ });
        }
        Err(e) => report.errors.push(
            format!("{}: {e}", contact.email)
        ),
    }
}

Each check in the quality function maps directly to a deliverability failure mode. Subject length over 60 characters gets truncated by most mobile email clients, breaking the impression before the body loads. A subject where more than half the characters are uppercase matches a pattern that spam classifiers have flagged for years — it signals aggression to filters and readers alike. Spam trigger words like "free," "urgent," and "act now" each carry a 0.15 penalty against the spam score because they appear disproportionately in bulk unsolicited mail and Bayesian filters are trained on exactly those distributions. Body word count matters at both ends: under 100 words reads as a template blast, over 250 words loses most readers before they reach the ask. The CTA check is the simplest signal but arguably the most important — an email without a clear call to action wastes the reader's attention and produces zero pipeline outcomes regardless of how personalized the opening is.

fn check_quality(subject: &str, body: &str) -> QualityChecks {
    let word_count = body.split_whitespace().count();

    let mut spam = 0.0;
    let spam_triggers = [
        "free", "urgent", "act now", "limited time",
        "winner", "click here", "buy now", "!!!",
    ];
    for trigger in &spam_triggers {
        if lower_subj.contains(trigger) { spam += 0.15; }
    }
    if subject.chars()
        .filter(|c| c.is_uppercase()).count()
        > subject.len() / 2
    {
        spam += 0.2;
    }

    let has_cta = lower_body.contains("call")
        || lower_body.contains("chat")
        || lower_body.contains("meet")
        || lower_body.contains("schedule");

    QualityChecks {
        subject_length_ok: subject_len > 0
            && subject_len <= 60,
        body_word_count: word_count,
        body_length_ok: (100..=250).contains(&word_count),
        has_cta,
        spam_score: f64::min(spam, 1.0),
    }
}

Then the mandatory approval gate. No email ever gets sent without explicit human confirmation:

eprintln!(
    "  APPROVAL REQUIRED: Type 'approve' \
     to mark ready, or 'reject':"
);
let mut input = String::new();
std::io::stdin().read_line(&mut input)?;
if input.trim().to_lowercase() == "approve" {
    OutreachStatus::Approved
} else {
    OutreachStatus::Rejected
}

This gate is not optional. CAN-SPAM and GDPR both place legal accountability on the sender, not the model. Beyond compliance, email providers build sender reputation from aggregate engagement signals — bounce rates, spam reports, unsubscribes. A single bad batch from an automated pipeline can suppress your domain's deliverability for weeks and land you on shared blacklists that are expensive to escape. The automation drafts, the human decides. That boundary is where the system earns the right to exist.

Fine-Tuning Qwen3 with MLX LoRA

Think of fine-tuning like teaching a smart assistant your specific writing style. Qwen3-1.7B already knows English, grammar, and the general shape of a business email — that's the "smart assistant" part. Fine-tuning is just you sitting down with it and saying: here are 500 cold emails I've actually sent, here's what a good one looks like, here's what a bad one looks like — now write like me. The model updates its weights to internalize those patterns, not just follow instructions.

That distinction matters. Prompt engineering tells the model what to do on every single call. Fine-tuning bakes the behavior in. For cold outreach this means three concrete wins: consistency (every email follows your format without you describing it each time), no per-call token overhead from multi-paragraph system prompts, and hard-learned constraints like "never exceed 120 words" that the model absorbed from examples rather than being reminded of at inference time. I used Apple's MLX framework for training — it runs natively on Apple Silicon with zero-copy Metal GPU access, making LoRA fine-tuning fast and practical on a MacBook.

Training Data: Real Emails + Synthetic Augmentation

Data comes from two sources. First, real sent emails exported from Neon PostgreSQL and the Resend API, joining contact_emails, contacts, and companies tables. Quality signals include reply_received (strong positive) and opened (weak positive):

SYSTEM_PROMPT = (
    "You write B2B outreach emails for Vadim Nicolai, "
    "Senior Software Engineer "
    "(10+ years: React, TypeScript, AI/ML, Rust, "
    "Node.js, GraphQL). "
    'Output ONLY valid JSON: '
    '{"subject": "...", "body": "..."}'
)

The signal hierarchy matters here. An opened event can be a tracking pixel firing in a corporate mail scanner, a bot, or a preview pane — the recipient may never have read a word. A reply means a human read it, understood it, and chose to respond. Training on reply-correlated examples teaches the model what actually earns attention, not just what clears the spam filter.

Real data alone, however, is sparse and skewed. You have more initial outreach than follow-ups, more rejections than replies, and coverage gaps for certain industries or personas. That's where synthetic augmentation earns its place. DeepSeek acts as a teacher model, generating plausible emails by combining company profiles, recipient personas, email types (initial, followup_1, followup_2, followup_3), and style variants — filling those gaps with controlled diversity. The synthetic pipeline also generates negative examples for contrastive training: emails that are too long, too generic, or too salesy, paired with labels so the model learns the failure modes explicitly:

python3 generate_synthetic_emails.py \
  --count 1000 --neg-ratio 0.15

LoRA Configuration for M1 16GB

To understand the configuration choices here, it helps to first understand what LoRA actually does. A language model like this one contains 1.7 billion individual numbers — weights — that encode everything it knows about language. Fine-tuning all of them would require storing gradients and optimizer states for every single weight, which on a 16GB M1 is simply not possible. LoRA sidesteps this entirely. Instead of retraining all 1.7 billion numbers, it freezes the original model and trains a tiny set of new numbers that get added on top. Think of it as placing a thin lens over the model's existing knowledge rather than rebuilding its brain. The lens is small, cheap to train, and can be swapped out — but it meaningfully shifts how the model behaves.

The training config is carefully tuned for the M1's 16GB unified memory constraint:

"outreach-email": TrainConfig(
    model="mlx-community/Qwen3-1.7B-4bit",
    max_seq_length=512,
    batch_size=2,
    grad_accumulation_steps=8,  # effective batch = 16
    learning_rate=1e-5,
    epochs=5,
    lora=LoRAConfig(rank=8, alpha=32.0, dropout=0.1),
    warmup_steps=30,
)

The central LoRA parameter is rank, which controls the size of that lens. A rank of 1 is almost no lens at all — the model can barely shift its behavior. A rank of 64 is a much larger lens, capable of capturing complex patterns, but it also has far more parameters to train. With a small fine-tuning dataset like a set of outreach emails, high rank is a liability: the model starts memorising the training examples instead of learning the underlying style. Rank 8 sits in the sweet spot — enough capacity to learn tone, structure, and personalisation patterns, not so much that it collapses onto the training set.

Alpha 32 sets the scaling factor that determines how strongly the LoRA updates influence the frozen base weights. With rank 8, that gives a scale factor of 4.0 — a strong signal that ensures the fine-tuned behaviour actually surfaces in outputs rather than being washed out by the base model's momentum.

Sequence length and batch size are where the memory arithmetic gets interesting. Emails run roughly 400 tokens, so capping at 512 is accurate without waste — and every token saved per sample directly translates to room for more samples per batch. But rather than setting batch_size=16 directly, the config uses batch_size=2 with grad_accumulation=8, yielding the same effective batch of 16. The reason is physical: in a true batch of 16, all 16 samples live in GPU memory simultaneously during the forward pass. With gradient accumulation, only 2 samples occupy memory at once, and the gradients are summed across 8 sequential steps before the weight update fires. The model sees the same effective signal; the GPU sees a fraction of the memory pressure.

Learning rate 1e-5 is deliberately gentler than the 2e-5 used for classification tasks. Classification has a clear binary signal — right or wrong. Email quality is continuous and subjective: rhythm, warmth, specificity. A higher learning rate would chase that noisy signal too aggressively and destabilise training. Cosine decay with a short warmup period compounds this: the warmup prevents the optimiser from making large destructive updates before it has seen enough data to orient itself, and the cosine schedule then decays the rate smoothly rather than dropping it off a cliff.

To validate that rank 8 and 1e-5 were actually the right choices rather than reasonable guesses, I ran a 3x3 hyperparameter sweep across three rank values and three learning rates:

for _rank in [4, 8, 16]:
    for _lr in [5e-6, 1e-5, 2e-5]:
        _key = f"outreach-email-sweep/r{_rank}_lr{_lr:.0e}"

A sweep like this trains a separate model for each combination and measures validation loss — how well each model predicts held-out examples it never trained on. Low training loss but high validation loss means the model memorised the training data; low validation loss means it generalised. Rank 8 with 1e-5 learning rate produced the lowest validation loss across runs, which is the confirmation that these numbers are not just memory-safe but genuinely optimal for this dataset size and task.

Evaluation: Beyond Loss Curves

Training loss going down is not the same thing as the model getting better at writing cold emails. Loss measures one thing: how well the model predicts the next token in a sequence. A model can reach near-perfect loss on your training set by learning to reproduce the statistical shape of your examples — common word patterns, sentence lengths, punctuation habits — without ever producing an email that a human would read and reply to. Low loss means the model is fluent. It says nothing about whether the subject line fits on a mobile screen, whether the email triggers a spam filter, or whether the recipient has any idea what you want from them.

That gap is why I built a rule-based evaluation pipeline that scores each generated email on properties that directly affect deliverability and conversion. Every check has a concrete failure mode attached to it.

def score_email(
    parsed: dict | None,
    email_type: str = "initial"
) -> dict:
    scores = {
        "json_valid": False,
        "subject_ok": False,
        "word_count_ok": False,
        "has_placeholder": False,
        "no_spam": False,
        "has_cta": False,
        "has_sign_off": False,
    }

json_valid is the gating check: the pipeline expects structured output with a subject key and a body key. If the model produces malformed JSON — or wraps the output in markdown fences it forgot to strip — the entire downstream pipeline crashes. No parse, no send. subject_ok enforces a 60-character hard limit because Gmail truncates subject lines on mobile at roughly that length; a truncated subject reads as unprofessional and suppresses open rates. no_spam scans the body against a blocklist of high-signal spam words — "free", "act now", "guaranteed" — because spam filters use keyword matching as a fast first pass, and a single match can send the email straight to junk before any human ever sees it. has_cta checks that the email contains at least one action verb tied to a meeting or conversation; an email that ends without a clear ask wastes the recipient's attention and leaves the sender with no signal. has_sign_off verifies the final line contains a closing word, because emails that end abruptly after the pitch read as machine-generated and lower reply rates.

Word count limits are type-aware because follow-up emails should get progressively shorter:

Email Type	Word Count Range
Initial	80-220 words
First follow-up	60-150 words
Second follow-up	50-130 words
Final follow-up	35-100 words

The reasoning is simple: each touchpoint in a sequence earns less patience than the last. The first email can afford to establish context and make a case. By the fourth email, the recipient has already seen your name three times and decided not to reply — a long message signals you haven't noticed. Three focused sentences is the right budget at that point; anything more gets deleted.

Beyond structural checks, the eval pipeline computes two semantic metrics. TF-IDF cosine similarity measures how closely the generated emails resemble the reference emails in the training set — a similarity score that drops sharply is an early warning that the model has drifted away from the vocabulary and tone that produced good results. Diversity scoring inverts that: it computes one minus the mean pairwise similarity across all generated emails. A diversity score near zero means the model has collapsed onto a single template, substituting different company names into the same skeleton. Both numbers need to stay in range; a model that is too similar to references is overfitting, and a model that is too dissimilar has gone off-distribution.

Apple Silicon Constraints and Optimizations

Apple Silicon — the M-series chips powering modern MacBooks — uses a design called unified memory architecture. Unlike a traditional desktop setup where the CPU has its own RAM and the GPU has its own dedicated VRAM, Apple Silicon puts everything on one shared pool. The GPU and CPU read from and write to the same physical memory. There is no PCIe bus to copy data across, no transfer latency, no duplication. For machine learning inference, this is significant: the model weights live once in memory and the GPU can reach them directly.

But fast access alone does not make inference instant. The real bottleneck on M1 is memory bandwidth — 68.25 GB/s — not raw compute. The GPU can multiply matrices far faster than it can read the numbers to multiply. So the speed limit is not how much math the chip can do; it is how quickly the model's weights can be streamed from memory into the compute units. This is true of most modern inference workloads, and it is why quantization matters so much here.

A standard model parameter is stored as a 32-bit or 16-bit floating-point number. 4-bit quantization compresses each parameter down to just 4 bits — four to eight times smaller. The model reads from memory faster because there is simply less to read. Quality does drop slightly, but for email generation the difference is imperceptible in practice.

The M1 also has an 8MB System-Level Cache (SLC), a fast on-chip buffer sitting between main memory and the compute cores. A 512-token attention KV cache fits comfortably within this 8MB boundary, which means the most frequently accessed state during a generation pass stays in the fastest possible layer of the memory hierarchy.

The memory math then becomes straightforward: the 1.7B-4bit email model consumes roughly 1GB; the 3B-4bit classification model takes about 2GB. Both run simultaneously well within the 16GB ceiling. The preflight check exists to confirm both inference servers are actually healthy before any pipeline work begins — catching a crashed server early avoids silent failures mid-batch.

make leads-preflight  # Verifies LLM servers on :8080 and :8081

Results After 1,100+ Training Steps

Metric	Base Model	Fine-Tuned
JSON parse rate	~85%	~100%
Subject length compliance	~70%	95%+
Spam score	Occasional triggers	Consistently < 0.1
Generation speed	~50ms	~50ms (no degradation)
Personalization	Generic platitudes	References actual tech stack

These numbers matter more than they first appear, so it's worth unpacking each one.

The JSON parse rate jumping from 85% to 100% means the difference between a pipeline that runs and one that doesn't. At 85%, 15 out of every 100 emails throw a parse error — the draft is lost, the contact is skipped, and you have silent gaps in your outreach batch. At 100%, zero failures. Every contact gets a draft. The pipeline becomes boring in the best way.

Subject line compliance going from 70% to 95% has two practical consequences: mobile truncation and spam filtering. Most email clients cut subject lines off around 50-60 characters. Before fine-tuning, roughly 30% of generated subjects were being clipped mid-sentence on recipients' phones — or worse, tripping spam heuristics from excessive punctuation and vague phrasing. At 95% compliance, subjects land as intended.

A spam score below 0.1 means Gmail and Outlook route these to the inbox, not the junk folder. That's the whole game in cold outreach. You can write the best email in the world and it doesn't matter if it never gets seen.

The personalization improvement is the one that changes how the emails read. Before fine-tuning, the model would write "I noticed your innovative approach to the industry" — a phrase that could apply to any company in any sector. After training on real examples, the model reads the tech_stack field, sees that the company runs Kubernetes on GCP with a Go backend, and writes specifically about that. It's not simulated personalization; it's actual context use.

The entire pipeline — from contact list to reviewed drafts — runs offline on a laptop with no API keys needed for inference. The only cloud dependency is the Neon PostgreSQL database where contacts and companies live.

What I'd Build Differently

Structured output mode. Right now the model generates text and a post-processing step strips markdown fences, fixes stray quotes, and retries on parse failures. There is a cleaner way: grammar-constrained decoding restricts which tokens the model can generate at each step, so it is physically incapable of outputting malformed JSON. mlx_lm now supports this. The parsing dance goes away entirely.

DPO over SFT. Supervised fine-tuning shows the model good examples and says "write like this." Direct Preference Optimization also shows bad examples and says "don't write like this." The synthetic data pipeline already generates negative examples — low-quality drafts used as rejection signal. Wiring those into DPO training would teach the model to actively avoid weak patterns, not just imitate strong ones.

Engagement feedback loop. Right now training is static. But Resend already tracks reply_received and opened per campaign. Those signals could automatically reweight the training data — emails that got replies become higher-weight positive examples, emails that were ignored become negatives. The model would gradually optimize for replies rather than just format compliance.

FAQ

Q: What is LoRA in AI fine-tuning? A: LoRA (Low-Rank Adaptation) trains small rank-decomposition matrices injected into a frozen pre-trained model, rather than updating all the weights. In practice this reduces trainable parameters to roughly 0.2% of the full model — which is why fine-tuning runs on a laptop at all. Without LoRA you'd need a data center; with it, you need a MacBook and a few hours.

Q: Can I use a fine-tuned model for fully automated email sending? A: No — and this isn't just a liability hedge. Automated outreach without human review is how you get your domain blacklisted, your email provider account suspended, and your company's reputation damaged with prospects before a conversation even starts. This system has a mandatory approval gate: every draft must be explicitly accepted or rejected before anything leaves your server. The model drafts; a human decides.

Q: How much data is needed to fine-tune a model like Qwen for emails? A: Here, a combination of real sent emails exported from Resend with engagement signals, plus synthetic data generated by DeepSeek as a teacher model, totaling roughly 1,000 examples. Quality matters more than volume — a few hundred high-signal real examples anchors the style, and synthetic augmentation fills the coverage gaps for edge cases like missing tech stack fields or uncommon verticals.

Q: Is running local inference actually faster than a cloud API? A: For batch processing, decisively yes. The fine-tuned Qwen3-1.7B generates drafts at around 50ms each on M1, versus 400-800ms round-trips to cloud APIs. For a batch of 200 contacts that's roughly 10 seconds locally against 80-160 seconds via API — and you save $2-6 per batch in token costs. The latency advantage compounds across repeated runs during a campaign cycle.

The promise of AI cold email automation is real, but it is realized through specialization, not generality. A general-purpose LLM writes for everyone, which means it writes compellingly for no one in particular. By fine-tuning Qwen3 with LoRA on real outreach data, adding Rust quality gates that enforce structural and deliverability constraints, and keeping a human in the loop before anything sends, you stop using a generic tool and start operating a dedicated engine — one that runs on your own hardware, speaks your prospect's actual stack and context, improves as your campaign data accumulates, and costs nothing per email to run once it's trained.

Why Local Inference for Cold Email Outreach​

Architecture: Three Models, Three Jobs​

The Rust Orchestration Layer​

Quality Gates: Catching Bad Emails Before Humans See Them​

Fine-Tuning Qwen3 with MLX LoRA​

Training Data: Real Emails + Synthetic Augmentation​

LoRA Configuration for M1 16GB​

Evaluation: Beyond Loss Curves​

Apple Silicon Constraints and Optimizations​

Results After 1,100+ Training Steps​

What I'd Build Differently​

FAQ​