4 posts tagged with "evals"

The Two-Layer Model That Separates AI Teams That Ship from Those That Demo

March 9, 2026 · 72 min read

Senior Software Engineer

In February 2024, a Canadian court ruled that Air Canada was liable for a refund policy its chatbot had invented. The policy did not exist in any document. The bot generated it from parametric memory, presented it as fact, a passenger relied on it, and the airline refused to honor it. The tribunal concluded it did not matter whether the policy came from a static page or a chatbot — it was on Air Canada's website and Air Canada was responsible. The chatbot was removed. Total cost: legal proceedings, compensation, reputational damage, and the permanent loss of customer trust in a support channel the company had invested in building.

This was not a model failure. GPT-class models producing plausible-sounding but false information is a known, documented behavior. It was a process failure: the team built a customer-facing system without a grounding policy, without an abstain path, and without any mechanism to verify that the bot's outputs corresponded to real company policy. Every one of those gaps maps directly to a meta approach this article covers.

In 2025, a multi-agent LangChain setup entered a recursive loop and made 47,000 API calls in six hours. Cost: $47,000+. There were no rate limits, no cost alerts, no circuit breakers. The team discovered the problem by checking their billing dashboard.

These are not edge cases. An August 2025 Mount Sinai study (Communications Medicine) found leading AI chatbots hallucinated on 50–82.7% of fictional medical scenarios — GPT-4o's best-case error rate was 53%. Multiple enterprise surveys found a significant share of AI users had made business decisions based on hallucinated content. Gartner estimates only 5% of GenAI pilots achieve rapid revenue acceleration. MIT research puts the fraction of enterprise AI demos that reach production-grade reliability at approximately 5%. The average prototype-to-production gap: eight months of engineering effort that often ends in rollback or permanent demo-mode operation.

The gap between a working demo and a production-grade AI system is not a technical gap. It is a strategic one. Teams that ship adopt a coherent set of meta approaches — architectural postures that define what the system fundamentally guarantees — before they choose frameworks, models, or methods. Teams that demo have the methods without the meta approaches.

This distinction matters more now that vibe coding — coding by prompting without specs, evals, or governance — has become the default entry point for many teams. Vibe coding is pure Layer 2: methods without meta approaches. It works for prototypes and internal tools where failure is cheap. But the moment a system touches customers, handles money, or makes decisions with legal consequences, vibe coding vs structured AI development is the dividing line between a demo and a product. Meta approaches are what get you past the demo.

This article gives you both layers, how they map to each other, the real-world failures that happen when each is ignored, and exactly how to start activating eval-first development and each other approach in your system today.

Industry Context (2025–2026)

McKinsey reports 65–71% of organizations now regularly use generative AI. Databricks found organizations put 11x more models into production year-over-year. Yet S&P Global found 42% of enterprises are now scrapping most AI initiatives — up from 17% a year earlier. IDC found 96% of organizations deploying GenAI reported costs higher than expected, and 88% of AI pilots fail to reach production. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from less than 5% in 2025. Enterprise LLM spend reached $8.4 billion in H1 2025, with approximately 40% of enterprises now spending $250,000+ per year.

AI Observability for LLM Evals with Langfuse

February 7, 2026 · 10 min read

Vadim Nicolai

Senior Software Engineer

This article documents an evaluation harness for a Remote EU job classifier—but the real focus is AI observability: how to design traces, spans, metadata, scoring, and run-level grouping so you can debug, compare, and govern LLM behavior over time.

Schema-First RAG with Eval-Gated Grounding and Claim-Card Provenance

February 5, 2026 · 7 min read

Vadim Nicolai

Senior Software Engineer

This article documents a production-grade architecture for generating research-grounded therapeutic content. The system prioritizes verifiable artifacts (papers → structured extracts → scored outputs → claim cards) over unstructured text.

You can treat this as a “trust pipeline”: retrieve → normalize → extract → score → repair → persist → generate.