Skip to main content

One post tagged with "sdlc"

View All Tags

The Two-Layer Model That Separates AI Teams That Ship from Those That Demo

· 21 min read
Vadim Nicolai
Senior Software Engineer

Building AI systems is not one problem—it is a set of problems layered on top of each other. To navigate them, you need two different levels of thinking:

  • Meta approaches — the umbrella mindset that determines what you optimize for (grounding, observability, correctness, human oversight, routing). These exist because AI systems are probabilistic, data-dependent, and can hallucinate, drift, and fail silently.
  • X-driven methods — the concrete artifact you iterate against in a given phase (prompt, schema, eval suite, traces). These are the hands-on loops inside the meta strategy.

Most teams get the X-driven methods right and the meta approach wrong. They write evals but don't adopt an eval-first posture. They instrument traces but don't build observability-first. They add RAG but don't commit to a grounding-first policy. The result is a system that works in demos and fails in production.

This article gives you both layers, how they map to each other, and real-world adoption data on where the industry has landed.

Industry Context (2025)

McKinsey reports 78% of organizations now use AI in at least one business function—up from 55% twelve months prior. Databricks found organizations put 11× more models into production year-over-year. Yet MIT research finds only 5% of GenAI pilots achieve rapid revenue acceleration. The gap is almost always strategic, not technical.


The Two-Layer Model

Every AI system decision lives on one of two levels. Layer 1 decides what you optimize for—the strategic guarantee your system makes. Layer 2 decides how you iterate—the concrete artifact you build and loop against day-to-day.

Layer 1 — Meta Approaches (what you optimize for)

Meta ApproachCore Guarantee
🟢 Grounding-FirstAnswers are backed by evidence or the system abstains
🟣 Eval-FirstEvery change is tested against a defined correctness bar
🔵 Observability-FirstEvery production failure is reproducible from traces
🟡 Multi-Model / Routing-FirstTasks route to the right model by difficulty, cost, and capability
🩷 Human-Validation-FirstHigh-stakes outputs require human sign-off before reaching users
Spec-Driven (cross-cutting)Target behavior is explicit, checkable, and enforceable at every phase

Layer 2 — X-Driven Methods (how you iterate)

MethodPrimary ArtifactSDLC Phase
Prompt-Driven / Example-DrivenSystem prompt · golden setDiscover
Schema / Contract-Driven · Tool / API-Driven · Workflow / Graph-DrivenTyped schemas · tool contracts · step graphsBuild
Retrieval-Driven (RAG) · Data & Training-DrivenRetriever config · dataset versionsBuild
Evaluation-Driven (EDD) · Metric-DrivenEval suite · KPI dashboardsVerify
Trace / Observability-DrivenTrace schema · replay harnessOperate

How the layers connect

Read the diagram top-down: pick your meta approach (Layer 1) based on your biggest risk, then follow the arrow to the methods (Layer 2) it activates.


Part I — The Meta Approaches

These six meta approaches are the ones with the highest industry evidence for adoption in 2024–2025. They answer the question: before you pick a method, what is your system fundamentally trying to guarantee?


1. Grounding-First

"Answers must be grounded in evidence—or the system must abstain."

This is the most widely adopted AI-native meta approach. It is a posture that says: the model's parametric knowledge is not trustworthy enough on its own—every answer must be tied to a verifiable source, or the system should say nothing.

Grounding-first shapes your architecture from day one: retrieval is not a feature you add later, it is a foundational constraint. Your grounding policy (what sources are allowed, when to cite, when to abstain) is written before you write prompts.

Why it is AI-native: Traditional software either has data or it doesn't. LLMs will confidently produce an answer whether or not they have the data—grounding-first is the only way to control this.

Adoption evidence

Menlo Ventures' 2024 State of Enterprise AI found RAG at 51% enterprise adoption, up from 31% the prior year. A survey of 300 enterprises found 86% augmenting their LLMs via RAG or similar retrieval. Databricks reported vector databases supporting RAG grew 377% year-over-year. The RAG market is projected to grow from 1.2B(2024)to1.2B (2024) to 11B by 2030 at a 49.1% CAGR (Grand View Research).

X-driven methods it activates: Retrieval-Driven (RAG) · Schema-Driven (grounding schema) · Eval-Driven (groundedness eval suite)

Key tools: Pinecone · Weaviate · Chroma · Qdrant · LlamaIndex · LangChain RAG · Vectara · Cohere Rerank

Adopt whenAccuracy and citations matter; proprietary data; regulated domains; hallucination has real cost
TrapsTreating retrieval as an add-on after prompts fail; not measuring retrieval quality independently

2. Eval-First

"Define what 'correct' means as a test before you build the system that must pass it."

Eval-first is the AI equivalent of test-driven development. The distinction from simply "having evals" is that the eval suite exists before the implementation and acts as the acceptance criterion. You don't write code to pass a linter—you write code to pass the eval. The CI gate is your quality contract.

This is the most structurally important meta approach for teams that want to iterate safely. Without it, every prompt change is a gamble.

Why it is AI-native: Traditional software tests are deterministic. AI evals are probabilistic, multi-dimensional, and context-dependent. They require rubrics, human calibration, slice analysis, and LLM-as-judge scaffolding that has no equivalent in traditional QA.

Adoption evidence

A 2025 arXiv paper (Xia et al., 2411.13768) formalized EDDOps (Evaluation-Driven Development and Operations) from a multivocal literature review of industrial practices. DeepEval reports nearly 500,000 monthly downloads. Andrej Karpathy flagged an "evaluation crisis" in March 2025, noting standard benchmarks have saturated—the field is shifting to task-specific, production-grounded eval suites. Leading platforms: Braintrust, Langfuse, Arize Phoenix, LangSmith, Confident AI, MLflow 3.0.

X-driven methods it activates: Evaluation-Driven (EDD) · Example-Driven (golden set) · Metric-Driven (production signal)

Key tools: DeepEval · Braintrust · Arize Phoenix · LangSmith · Confident AI · Galileo · MLflow 3.0

Adopt whenYou're making changes that could silently regress behavior; swapping models or providers; multiple engineers on the same system
TrapsUsing off-the-shelf benchmarks instead of production-grounded evals; "teaching to the test" without refreshing the eval set
The Saturation Problem

Standard benchmarks (MMLU, etc.) now show >90% saturation on leading models—they no longer discriminate. The only evals that matter are the ones you write yourself, from your real production failures.


3. Observability-First

"Instrument before you scale. Every production failure must be reproducible."

Observability-first means you cannot ship a system to production without the ability to capture, replay, and diff what happened. Traces are not a debugging convenience—they are the mechanism by which production failures convert into future eval cases and system improvements.

The "first" is the important part: teams that add observability after scaling discover that the most important failures happened before they started logging.

Why it is AI-native: Deterministic software can be unit-tested into confidence. AI systems are probabilistic—the same input can produce different outputs on different runs. You cannot reason about production correctness without capturing the actual inputs, model versions, retrieved documents, and tool calls that produced each output.

Adoption evidence

In agent-building organizations, approximately 89% report implementing observability as a baseline practice. Langfuse (open-source, MIT) has reached over 19,000 GitHub stars and 6 million SDK installs per month as of late 2025. LangSmith serves over 100,000 members. Enterprise APM vendors Datadog and New Relic both launched dedicated LLM observability modules in 2024–2025, signaling market maturity.

X-driven methods it activates: Trace/Observability-Driven (LLMOps) · Metric-Driven (cost/latency) · Eval-Driven (replay → eval)

Key tools: Langfuse · LangSmith · Arize Phoenix · Braintrust · Helicone · W&B Weave · MLflow 3.0 · Datadog LLM Obs

Adopt whenYou have agents taking actions; multi-step workflows; "why did it do that?" questions you can't answer; cost visibility needed
TrapsLogging too little (no insight); too much (privacy/overhead); tracing without replay = data without utility
Privacy Rule #1

Define a PII redaction policy before enabling production traces. User inputs regularly contain sensitive data. Redact or hash at the application boundary—never log raw user input to a third-party platform without a data processing agreement.


4. Multi-Model / Routing-First

"No single model is optimal for all tasks. Route dynamically based on difficulty, cost, and capability."

Multi-model / routing-first is the recognition that the LLM layer is a fleet, not a single engine. Different tasks warrant different models: simple classification → small cheap model; complex reasoning → frontier model; code generation → code-specialized model. Routing policy is a first-class product decision, not an implementation detail.

Why it is AI-native: Traditional software modules do not have variable "capability levels" that you trade off against cost and latency. LLMs do—and that trade-off space is large enough to be a competitive moat.

Adoption evidence

Menlo Ventures found enterprises "typically deploy 3+ foundation models" and route per use case—described as "the pragmatic norm" for 2024. 37% of enterprises use 5+ models in production environments. The Anthropic/OpenAI/Google competition has driven model quality differentiation, making routing economically significant. Model API spending more than doubled to $8.4B in 2025.

X-driven methods it activates: Schema/Contract-Driven (router contract) · Metric-Driven (cost/quality per route) · Observability-Driven (route distribution monitoring)

Key tools: OpenRouter · Portkey · LiteLLM · AWS Bedrock multi-model · Azure AI Foundry routing

Adopt whenCost is a constraint; tasks vary significantly in complexity; you want model-swap resilience
TrapsRouter mis-classification sends complex tasks to weak models; routing adds latency without quality-gating

5. Human-Validation-First (HITL)

"Define explicitly which outputs require human validation before they reach users—before you build."

Human-in-the-loop first is not about slowing AI down. It is about deciding, architecturally, where the human is in the loop—before you design the feedback pipeline, the review queue, the annotation tooling, or the escalation SLA. Teams that add HITL after deployment discover they've built pipelines with no natural review points.

Why it is AI-native: Traditional software is deterministic—outputs are either correct or they have a bug. AI outputs are probabilistic and can be plausibly wrong—meaning incorrect outputs pass automated checks and only surface via human review or user complaints.

Adoption evidence

~89% of organizations with LLMs in production agree that having a human in the loop is important to some degree (Unisphere Research, 2024). Top-performing AI teams are significantly more likely to have explicit processes defining when model outputs require human validation before reaching end users (common in healthcare, legal, finance).

X-driven methods it activates: Metric-Driven (review rate, agreement rate) · Observability-Driven (escalation traces) · Data-Driven (human decisions as training signal)

Key tools: Scale AI · Humanloop · Labelbox · Argilla · Prodigy · Custom annotation queues

Adopt whenHigh-stakes outputs (medical, legal, financial); irreversible actions; compliance or audit requirements
TrapsReview queue becomes a bottleneck with no SLA; reviewers' decisions aren't logged as training signal; HITL added as afterthought with no natural integration point

Part II — The Cross-Cutting Meta Approach: Spec-Driven

Spec-driven development is not one phase—it is a progression that runs through all four phases of the AI SDLC. It is the answer to the question: how do we make our target behavior explicit, checkable, and enforceable?

The key move is: make specs executable. A spec that cannot be checked is just a hope.

The Spec Ladder

How Spec Maps to the X-Driven Methods

One-sentence summary: Spec starts as intent in Discover, hardens into contracts in Build, becomes enforceable tests in Verify, and becomes "what must remain true" in Operate.


Part III — The 10 X-Driven Methods: Implementation Layer

The meta approaches define what you're optimizing for. The X-driven methods are how you iterate within each phase. Each method names the primary artifact you build and loop against.

#MethodPhasePrimary ArtifactMeta Approach it Serves
1Prompt-DrivenDiscoverSystem prompt + registrySpec-Driven (narrative)
2Example-DrivenDiscoverGolden input/output setSpec-Driven (behavioral) · Eval-First
3Contract/Schema-DrivenBuildJSON Schema / typed modelSpec-Driven (formal) · Multi-Model
4Tool/API-DrivenBuildFunction schemas + contractsSpec-Driven (formal) · HITL
5Workflow/Graph-DrivenBuildStep graph + state schemaSpec-Driven (formal) · HITL
6Retrieval-Driven (RAG)BuildRetriever config + corpusGrounding-First
7Data & Training-DrivenBuildDataset versions + training specSpec-Driven (behavioral)
8Evaluation-Driven EDDVerifyEval suite + CI gateEval-First · Spec-Driven (executable)
9Metric-DrivenVerifyKPI dashboards + A/B testsEval-First · Multi-Model routing
10Trace/Observability-DrivenOperateTrace schema + replay harnessObservability-First · HITL

Choosing Your Stack

Start with the meta approach that matches your biggest failure mode, then activate the X-driven methods it requires.


Popularity & Maturity at a Glance

Based on 2024–2025 industry surveys, arXiv literature, and tool adoption data:


Ecosystem Map


Summary

The most important insight from the 2024–2025 data is that meta approaches and X-driven methods are not the same thing—and conflating them is the most common source of strategic confusion in AI teams.

An AI team that adopts Grounding-First knows why they're building a retrieval pipeline (accuracy guarantees) and can reason about trade-offs (cost vs. latency vs. freshness). A team that just "adds RAG" does not.

An AI team that adopts Eval-First treats evals as a product requirement, not a debugging tool. A team that just "has some evals" does not.

The path to production-grade AI is:

  1. Choose your meta approach based on your biggest failure mode
  2. Activate the X-driven methods it requires
  3. Harden your specs from narrative → formal → executable as you move through the lifecycle
  4. Let production traces feed back into your eval suite—closing the loop

Further Reading

  • EDDOps (Eval-First): Xia et al., Evaluation-Driven Development and Operations of LLM Agents, arXiv:2411.13768, Nov 2024 / Nov 2025
  • Enterprise RAG (Grounding-First): Menlo Ventures, 2024 State of Generative AI in the Enterprise
  • Vector DB Growth: Databricks, State of AI: Enterprise Adoption & Growth Trends
  • RAG Market Forecast: Grand View Research, Retrieval Augmented Generation Market Report, 2024
  • Multi-Model Routing: Kong Enterprise Survey, LLM Adoption Statistics, 2025
  • HITL + Governance: Unisphere Research / Graphwise, State of Play on LLM and RAG, Dec 2024
  • Security: OWASP, Top 10 for Large Language Model Applications
  • Prompting: Anthropic, Prompt Engineering Overview