5 posts tagged with "llm"

The Two-Layer Model That Separates AI Teams That Ship from Those That Demo

February 25, 2026 · 53 min read

Senior Software Engineer

In February 2024, a Canadian court ruled that Air Canada was liable for a refund policy its chatbot had invented. The policy did not exist in any document. The bot generated it from parametric memory, presented it as fact, a passenger relied on it, and the airline refused to honor it. The tribunal concluded it did not matter whether the policy came from a static page or a chatbot — it was on Air Canada's website and Air Canada was responsible. The chatbot was removed. Total cost: legal proceedings, compensation, reputational damage, and the permanent loss of customer trust in a support channel the company had invested in building.

This was not a model failure. GPT-class models producing plausible-sounding but false information is a known, documented behavior. It was a process failure: the team built a customer-facing system without a grounding policy, without an abstain path, and without any mechanism to verify that the bot's outputs corresponded to real company policy. Every one of those gaps maps directly to a meta approach this article covers.

In 2025, a multi-agent LangChain setup entered a recursive loop and made 47,000 API calls in six hours. Cost: $47,000+. There were no rate limits, no cost alerts, no circuit breakers. The team discovered the problem by checking their billing dashboard.

These are not edge cases. A January 2025 Mount Sinai study found leading AI chatbots hallucinated on 50–82.7% of fictional medical scenarios — GPT-4o's best-case error rate was 53%. Forty-seven percent of enterprise AI users admitted making at least one major business decision based on hallucinated content in 2024. Gartner estimates only 5% of GenAI pilots achieve rapid revenue acceleration. MIT research puts the fraction of enterprise AI demos that reach production-grade reliability at approximately 5%. The average prototype-to-production gap: eight months of engineering effort that often ends in rollback or permanent demo-mode operation.

The gap between a working demo and a production-grade AI system is not a technical gap. It is a strategic one. Teams that ship adopt a coherent set of meta approaches — architectural postures that define what the system fundamentally guarantees — before they choose frameworks, models, or methods. Teams that demo have the methods without the meta approaches.

This article gives you both layers, how they map to each other, the real-world failures that happen when each is ignored, and exactly how to start activating eval-first development and each other approach in your system today.

Industry Context (2025)

McKinsey reports 78% of organizations now use AI in at least one business function — up from 55% twelve months prior. Databricks found organizations put 11x more models into production year-over-year. Yet MIT research finds only 5% of GenAI pilots achieve rapid revenue acceleration. The gap is almost always strategic, not technical. Enterprise LLM spend reached $8.4 billion in H1 2025 alone, with approximately 40% of enterprises now spending $250,000+ per year on LLM infrastructure.

LangSmith Prompt Management

February 12, 2026 · 13 min read

Vadim Nicolai

Senior Software Engineer

In the rapidly evolving landscape of Large Language Model (LLM) applications, prompt engineering has emerged as a critical discipline. As teams scale their AI applications, managing prompts across different versions, environments, and use cases becomes increasingly complex. This is where LangSmith's prompt management capabilities shine.

Langfuse Features: Prompts, Tracing, Scores, Usage

February 11, 2026 · 11 min read

Vadim Nicolai

Senior Software Engineer

A comprehensive guide to implementing Langfuse features for production-ready AI applications, covering prompt management, tracing, evaluation, and observability.

Overview

This guide covers:

Prompt management with caching and versioning
Distributed tracing with OpenTelemetry
User feedback and scoring
Usage tracking and analytics
A/B testing and experimentation

OpenRouter Integration with DeepSeek

February 11, 2026 · 9 min read

Vadim Nicolai

Senior Software Engineer

This article documents the complete OpenRouter integration implemented in Nomadically.work, using DeepSeek models exclusively through a unified API.

AI Observability for LLM Evals with Langfuse

February 7, 2026 · 10 min read

Vadim Nicolai

Senior Software Engineer

This article documents an evaluation harness for a Remote EU job classifier—but the real focus is AI observability: how to design traces, spans, metadata, scoring, and run-level grouping so you can debug, compare, and govern LLM behavior over time.

Overview​

Overview