Schema-First RAG with Eval-Gated Grounding and Claim-Card Provenance

February 5, 2026 · 7 min read

Senior Software Engineer

This article documents a production-grade architecture for generating research-grounded therapeutic content. The system prioritizes verifiable artifacts (papers → structured extracts → scored outputs → claim cards) over unstructured text.

You can treat this as a “trust pipeline”: retrieve → normalize → extract → score → repair → persist → generate.

System map

Core idea: Mastra coordinates agents and workflows. The workflow produces validated research artifacts. The agent generates content from those artifacts, not from raw model guesses.

Top-down runtime flows

A) Research artifact production

Flow: App → Research Workflow → Sources → LLM → Eval Gates → Storage

Steps:

Search & Retrieve - Query multiple research sources in parallel
Normalize - Deduplicate and fetch full details
Extract - LLM generates structured data via schema
Score - Eval gates check faithfulness and grounding
Repair - If score fails, repair with feedback and re-score
Persist - Save validated artifacts with eval traces

B) Content generation from validated artifacts

Guarantees:

The agent retrieves only accepted artifacts (passed gates).
Every output can attach provenance: artifact_ids_used[], scorer_versions, model_id, timestamp.

Architecture layers

Why this pipeline works

1) Schema-first extraction creates controllable artifacts

You treat every extraction as a typed object with invariants:

bounded arrays (keyFindings length constraints)
numeric ranges (relevanceScore, extractionConfidence)
explicit nullability for missing fields

This prevents “string soup” from leaking into persistence and makes evals deterministic.

2) Multi-source + dedupe optimizes coverage and spend

Retrieval stays cheap; judgment stays expensive. So you:

search multiple sources
normalize identity (DOI/title fingerprint)
dedupe
only then pay tokens for extraction + scoring

3) Eval gates + single repair pass keep trust high

You treat extraction as an untrusted build artifact:

run tests (scorers)
if failing: run a single repair step with feedback
re-test
persist only on pass

Claim cards: auditable statement-level evidence

Claim cards attach evidence to atomic claims and preserve provenance.

Operational outcome: you can enforce product rules like:

“insufficient evidence” → soften language + add uncertainty label
“contradicted/mixed” → present tradeoffs or avoid recommendation

Notes ingestion as first-class input

The system can treat a curated note (example: “state-of-remote-work”) as:

an input context object (goal framing, assumptions, topic scope)
a retrieval seed (keywords for paper search)
an artifact to index for later retrieval

Reference implementation

Use the “research-thera” repository as the canonical layout for:

app runtime (client + server boundaries)
persistence (LibSQL/Turso + migrations)
research pipeline wiring (workflow steps + tools)
artifact schema + eval traces + indexing strategy

The repo structure usually exposes these responsibilities clearly:

app/ and src/ for runtime surfaces
schema/ and migrations tooling for storage contracts
scripts/ for ingestion/backfills
cached HTTP responses for repeatable research runs (when enabled)

System map​

Top-down runtime flows​

A) Research artifact production​

B) Content generation from validated artifacts​

Architecture layers​

Why this pipeline works​

1) Schema-first extraction creates controllable artifacts​

2) Multi-source + dedupe optimizes coverage and spend​

3) Eval gates + single repair pass keep trust high​

Claim cards: auditable statement-level evidence​

Notes ingestion as first-class input​

Reference implementation​

URLs​