Blog | Vadim's blog

Observable AI Memory: mem0, LangGraph, and Qdrant with Enterprise-Grade Telemetry

June 2, 2026 · 13 min read

Senior Software Engineer

Most "AI memory" demos stop at memory.add() and memory.search(). That works on a laptop. It does not survive contact with production. The real questions are: When this recall is slow, which store is to blame? When a graph's spend triples overnight, which feature caused it? When a customer asks "what did your agent remember about me, and when?", can you answer from an audit log instead of a shrug?

TL;DR — This field report shows how to build an agent memory layer where every operation honors a contract: fail-open, PII-safe, and fully instrumented. Three stores (mem0, Qdrant, LangGraph) are funneled through single chokepoints, and each chokepoint fans out to five telemetry sinks. The result is a stack that answers the hard production questions without guesswork.

Modern neural TTS

May 28, 2026 · 12 min read

Vadim Nicolai

Senior Software Engineer

Semantic caching for LLMs

May 26, 2026 · 12 min read

Vadim Nicolai

Senior Software Engineer

Multi-Probe Bayesian Spam Gating: Filtering Junk Before Spending Compute

April 7, 2026 · 40 min read

Vadim Nicolai

Senior Software Engineer

In a B2B lead generation pipeline, every email that arrives costs compute. Scoring it for buyer intent, extracting entities, predicting reply probability, matching it against your ideal customer profile — each module is a DeBERTa forward pass. If 40% of inbound email is template spam, AI-generated slop, or mass-sent campaigns, you are burning 40% of your GPU budget on garbage.

The solution is a gating module: a spam classifier that sits at stage 2 of the pipeline and filters junk before anything else runs. But a binary spam/not-spam classifier is too blunt. You need to know why something is spam (template? AI-generated? role account?), how confident you are (is it ambiguous, or have you never seen this pattern before?), and which provider will block it (Gmail is stricter than Yahoo on link density).

This article documents a hierarchical Bayesian spam gating system with 4 aspect-specific attention probes, information-theoretic AI detection features, uncertainty decomposition, and a full Rust distillation path. The Python model trains on DeBERTa-v3-base. The Rust classifier runs at batch speed with 24 features and zero ML dependencies.

Building a ZoomInfo Alternative with Qwen and MLX: Local Buyer Intent Detection

April 1, 2026 · 11 min read

Vadim Nicolai

Senior Software Engineer

ZoomInfo charges $300+ per user per month for intent data — buying signals that tell sales teams which companies are actively in-market. It is the platform's number one feature and the reason enterprises pay six figures annually for access. But the underlying technology — classifying company content into intent categories — is a text classification problem. One that a 3-billion-parameter open-source model can solve on a single laptop.

Fine-Tune Qwen3 with LoRA for AI Cold Email Outreach

March 31, 2026 · 26 min read

Vadim Nicolai

Senior Software Engineer

An AI cold email engine does one thing: it reads what you know about a company and writes a personalized outreach email — automatically, at scale. If you've ever spent an afternoon manually tweaking 50 nearly-identical emails, you understand the problem. If you've paid for Instantly, Smartlead, or Apollo, you've already solved it — just not on your own terms.

Those SaaS tools charge $30-200/month, send your prospect list to their servers, and give you a black-box model you can't touch. You can't train it on your best-performing emails. You can't add custom quality gates. You can't run it offline. For engineers and technical founders, that's a bad deal.

This system is the alternative: a locally-run pipeline where you own every layer — model weights, scoring logic, and approval gates. The core is Qwen3-1.7B, fine-tuned with LoRA adapters on MLX (Apple's framework for M1/M2 Metal acceleration). A Rust orchestration layer drives the full batch loop: pulling company records, invoking the model, running quality filters, and surfacing emails for human review before anything sends.

The result is not a toy. On a single M1 MacBook Pro, the pipeline generates 200+ personalized emails per batch in under 10 seconds — no GPU cloud, no API latency, no per-email cost. Fine-tuning converges in under 30 minutes on the same machine.

TurboQuant: 3-Bit KV Caches with Zero Accuracy Loss

March 29, 2026 · 10 min read

Vadim Nicolai

Senior Software Engineer

Every token your LLM generates forces it to reread its entire conversational history. That history -- the Key-Value cache -- is the single largest memory bottleneck during inference. A Llama-3.1-70B serving a 128K-token context in FP16 burns through ~40 GB of VRAM on KV cache alone, leaving almost nothing for weights on a single 80 GB H100. The standard remedies -- eviction (SnapKV, PyramidKV) and sparse attention -- trade accuracy for memory. They throw tokens away.

TurboQuant, published at ICLR 2026 by Zandieh, Daliri, Hadian, and Mirrokni from Google Research, takes the opposite approach: keep every token, compress every value. At 3 bits per coordinate it delivers 6x memory reduction. At 4 bits it delivers up to 8x speedup in computing attention logits on H100 GPUs. The headline result: on LongBench with Llama-3.1-8B-Instruct, the 3.5-bit configuration scores 50.06 -- identical to the 16-bit baseline. No retraining. No fine-tuning. No calibration data.

ScrapeGraphAI Qwen3-1.7B: Fine-Tuned Web Extraction Model and 100k Dataset

March 28, 2026 · 51 min read

Vadim Nicolai

Senior Software Engineer

Leading cloud extraction APIs are orders of magnitude larger than the model that just beat them at structured web extraction. This isn't a marginal win — it's a 3.4 percentage point lead on the de facto standard SWDE benchmark. The secret isn't a novel architecture; it's domain-specific fine-tuning on a 100,000-example dataset of real scraping trajectories. The ScrapeGraphAI team's release of a fine-tuned Qwen3-1.7B model flips the conventional scaling law on its head and delivers a complete open-source stack (model and dataset under Apache 2.0, library under MIT) for production. This is a blueprint for how narrow, expert models will outperform generalist giants — if you have the right data.

How Novelty Drives an RL Web Crawler

March 26, 2026 · 14 min read

Vadim Nicolai

Senior Software Engineer

The most dangerous assumption in applied Reinforcement Learning (RL) is that useful exploration requires massive scale—cloud GPU clusters, terabytes of experience, and billion-parameter models. I built a system that proves the opposite. The core innovation of a production-grade, B2B lead generation web crawler isn't its performance, but its location: it runs entirely on an Apple M1 MacBook, with zero cloud dependencies. Its ability to navigate the sparse-reward desert of the web emerges not from brute force, but from a meticulously orchestrated multi-timescale novelty engine. This architecture, where intrinsic curiosity, predictive uncertainty, and a self-adjusting curriculum interlock, provides a general blueprint for building autonomous agents that must find needles in the world's largest haystacks.

Multi-Modal Evaluation for AI-Generated LEGO Parts: A Production DeepEval Pipeline

March 23, 2026 · 19 min read

Vadim Nicolai

Senior Software Engineer

Your AI pipeline generates a parts list for a LEGO castle MOC. It says you need 12x "Brick 2 x 4" in Light Bluish Gray, 8x "Arch 1 x 4" in Dark Tan, and 4x "Slope 45 2 x 1" in Sand Green. The text looks plausible. But does the part image next to "Arch 1 x 4" actually show an arch? Does the quantity make sense for a castle build? Would this list genuinely help someone source bricks for the build?

These are multi-modal evaluation questions — they span text accuracy, image-text coherence, and practical usefulness. Standard unit tests cannot answer them. This article walks through a production evaluation pipeline built with DeepEval that evaluates AI-generated LEGO parts lists across five axes, using image metrics that most teams haven't touched yet.

The system is real. It runs in Bricks, a LEGO MOC discovery platform built with Next.js 19, LangGraph, and Neon PostgreSQL. The evaluation judge is DeepSeek — not GPT-4o — because you don't need a frontier model to grade your outputs.