Skip to main content

From Zero-Copy Infrastructure to Intelligent Crawling: Building a Lead Generation Pipeline in Rust

· 10 min read
Vadim Nicolai
Senior Software Engineer

The most expensive failure in B2B lead generation isn't a broken script — it's a crawl that works perfectly but harvests nothing of value. Traditional crawlers, armed with static URL lists like /about and /team, achieve a dismal ~15% harvest rate because they lack the intelligence to adapt. They waste cycles on barren domains and miss rich pages hiding under unconventional paths. The breakthrough isn't a smarter algorithm in a slow pipeline; it's building a pipeline where intelligence is free. By combining zero-copy data infrastructure with adaptive crawling logic in Rust, you can build a system that discovers 2-3x more high-value leads while consuming a fraction of the resources of a Python-based equivalent.

How Novelty Drives an RL Web Crawler

· 14 min read
Vadim Nicolai
Senior Software Engineer

The most dangerous assumption in applied Reinforcement Learning (RL) is that useful exploration requires massive scale—cloud GPU clusters, terabytes of experience, and billion-parameter models. I built a system that proves the opposite. The core innovation of a production-grade, B2B lead generation web crawler isn't its performance, but its location: it runs entirely on an Apple M1 MacBook, with zero cloud dependencies. Its ability to navigate the sparse-reward desert of the web emerges not from brute force, but from a meticulously orchestrated multi-timescale novelty engine. This architecture, where intrinsic curiosity, predictive uncertainty, and a self-adjusting curriculum interlock, provides a general blueprint for building autonomous agents that must find needles in the world's largest haystacks.

Multi-Modal Evaluation for AI-Generated LEGO Parts: A Production DeepEval Pipeline

· 19 min read
Vadim Nicolai
Senior Software Engineer

Your AI pipeline generates a parts list for a LEGO castle MOC. It says you need 12x "Brick 2 x 4" in Light Bluish Gray, 8x "Arch 1 x 4" in Dark Tan, and 4x "Slope 45 2 x 1" in Sand Green. The text looks plausible. But does the part image next to "Arch 1 x 4" actually show an arch? Does the quantity make sense for a castle build? Would this list genuinely help someone source bricks for the build?

These are multi-modal evaluation questions — they span text accuracy, image-text coherence, and practical usefulness. Standard unit tests cannot answer them. This article walks through a production evaluation pipeline built with DeepEval that evaluates AI-generated LEGO parts lists across five axes, using image metrics that most teams haven't touched yet.

The system is real. It runs in Bricks, a LEGO MOC discovery platform built with Next.js 19, LangGraph, and Neon PostgreSQL. The evaluation judge is DeepSeek — not GPT-4o — because you don't need a frontier model to grade your outputs.

Synthetic Evaluation with DeepEval: A Production RAG Testing Framework

· 13 min read
Vadim Nicolai
Senior Software Engineer

Your RAG pipeline passes all 20 of your hand-written test questions. It retrieves the right context, generates grounded answers, and the demo looks great. Then it goes to production, and users start asking the 21st question — the one that exposes a retrieval gap, a hallucinated citation, or a context window that silently truncated the most relevant chunk. You had 20 tests for a knowledge base with 55 documents. That's 0.4% coverage. The other 99.6% was untested surface area.

Red Teaming LLM Applications with DeepTeam: A Production Implementation Guide

· 21 min read
Vadim Nicolai
Senior Software Engineer

Your LLM application passed all its unit tests. It's still dangerously vulnerable. This isn't just about a bug; it's about a fundamental misunderstanding of risk in autonomous systems. Consider this: an AI agent with a seemingly robust 85% accuracy per individual step has only a ~20% chance of successfully completing a 10-step task. That's the brutal math of compound probability in agentic workflows. The gap between functional correctness and adversarial safety is where silent, catastrophic failures live -- failures that manifest as cost-burning "Tool Storms" or logic-degrading "Context Bloat".

CrewAI's Genuinely Unique Features: An Honest Technical Deep-Dive

· 14 min read
Vadim Nicolai
Senior Software Engineer

TL;DR — CrewAI's real uniqueness is that it models problems as "build a team of people" rather than "build a graph of nodes" (LangGraph) or "build a conversation" (AutoGen). The Crews + Flows dual-layer architecture is the core differentiator. The role-playing persona system and autonomous delegation are ergonomic wins, not technical breakthroughs. The hierarchical manager is conceptually appealing but broken in practice. This post separates what's genuinely novel from what's marketing.

DeepEval for Healthcare AI: Eval-Driven Compliance That Actually Catches PII Leakage Before the FDA Does

· 20 min read
Vadim Nicolai
Senior Software Engineer

The most dangerous failure mode for a healthcare AI isn't inaccuracy—it's a compliance breach you didn't test for. A model can generate a perfect clinical summary and still violate HIPAA by hallucinating a patient's name that never existed. Under the Breach Notification Rule, that fabricated yet plausible Protected Health Information (PHI) constitutes a reportable incident. Most teams discover these gaps during an audit or, worse, after a breach. The alternative is to treat compliance not as a post-hoc checklist, but as an integrated, automated evaluation layer that fails your CI pipeline before bad code ships. This is eval-driven compliance, and it's the only way to build healthcare AI that doesn't gamble with regulatory extinction.

Reference implementation: Every code example in this article is drawn from Agentic Healthcare, an open-source blood test intelligence app that tracks 7 clinical ratios over time using velocity-based trajectory analysis. The full eval suite, compliance architecture, and production code are available in the GitHub repository.

The Case Against Mandatory In-Person Work for AI Startups

· 8 min read
Vadim Nicolai
Senior Software Engineer

The argument for an "office-first" culture is compelling on its face. It speaks to a romantic ideal of innovation: chance encounters, whiteboard epiphanies, and a shared mission forged over lunch. For a company building AI, this narrative feels intuitively correct. As a senior engineer who has worked in both colocated and globally distributed teams, I understand the appeal.

But intuition is not a strategy, and anecdotes are not data. When we examine the evidence and the unique constraints of an AI startup, a mandatory in-person policy looks like a self-imposed bottleneck. It limits access to the most critical resource—talent—and misunderstands how modern technical collaboration scales.

LLM as Judge: What AI Engineers Get Wrong About Automated Evaluation

· 20 min read
Vadim Nicolai
Senior Software Engineer

Claude 3.5 Sonnet rates its own outputs approximately 25% higher than a human panel would. GPT-4 gives itself a 10% boost. Swap the order of two candidate responses in a pairwise comparison, and the verdict flips in 10--30% of cases -- not because the quality changed, but because the judge has a position preference it cannot override.

These are not edge cases. They are the default behavior of every LLM-as-judge pipeline that ships without explicit mitigation. And most ship without it.

LLM-as-judge -- the practice of using a capable large language model to score or compare outputs from another LLM -- has become the dominant evaluation method for production AI systems. 53.3% of teams with deployed AI agents now use it, according to LangChain's 2025 State of AI Agents survey. The economics are compelling: 80% agreement with human preferences at 500x--5,000x lower cost. But agreement rates and cost savings obscure a deeper problem. Most teams adopt the method, measure the savings, and never measure the biases. The result is evaluation infrastructure that looks automated but is quietly wrong in systematic, reproducible ways.

This article covers the mechanism, the research, and the biases that break LLM judges in production.

What is LLM as a judge? LLM-as-a-Judge is an evaluation methodology where a capable large language model scores or compares outputs from another LLM application against defined criteria -- such as helpfulness, factual accuracy, and relevance -- using structured prompts that request chain-of-thought reasoning before a final score. The method achieves approximately 80% agreement with human evaluators, matching human-to-human consistency, at 500x--5,000x lower cost than manual review.