Skip to main content

LangSmith Prompt Management

· 13 min read
Vadim Nicolai
Senior Software Engineer

In the rapidly evolving landscape of Large Language Model (LLM) applications, prompt engineering has emerged as a critical discipline. As teams scale their AI applications, managing prompts across different versions, environments, and use cases becomes increasingly complex. This is where LangSmith's prompt management capabilities shine.

LangSmith, developed by LangChain, provides a comprehensive platform for managing, versioning, and collaborating on prompts—effectively bringing software engineering best practices to the world of prompt engineering.


The Challenge of Prompt Management

Why Prompt Management Matters

Prompts are the primary interface between your application and LLMs. As your AI application grows, you'll face several challenges:

  1. Version Control: Tracking changes to prompts over time and understanding what worked and what didn't
  2. Collaboration: Multiple team members working on prompts simultaneously
  3. Environment Management: Different prompts for development, staging, and production
  4. Performance Tracking: Understanding which prompt variations perform best
  5. Reusability: Sharing successful prompts across different projects and teams
  6. Rollback Safety: Ability to quickly revert to previous versions when something breaks

Without proper prompt management, these challenges can lead to:

  • Lost productivity from recreating prompts that were deleted or modified
  • Difficulty debugging when prompt changes cause unexpected behavior
  • Lack of visibility into what prompts are being used in production
  • Inability to A/B test different prompt variations systematically

LangSmith Prompt Management Features

1. Git-Like Version Control

LangSmith treats prompts as versioned artifacts, similar to how Git manages code. Each change to a prompt creates a new commit with a unique hash.

Key Features:

  • Commit History: Every modification is tracked with metadata
  • Diff Viewing: Compare versions to see exactly what changed
  • Rollback: Revert to any previous version instantly
  • Branching: Work on experimental prompts without affecting production
// Example from the codebase
export async function fetchLangSmithPromptCommit(
promptIdentifier: string,
options?: { includeModel?: boolean },
): Promise<LangSmithPromptCommit> {
const client = getLangSmithClient();

const commit = await client.pullPromptCommit(promptIdentifier, options);

return {
owner: commit.owner,
promptName: commit.repo,
commitHash: commit.commit_hash,
manifest: commit.manifest,
examples: commit.examples,
};
}

2. Collaborative Prompt Repositories

LangSmith organizes prompts into repositories, enabling team collaboration:

  • Public Prompts: Share prompts with the broader community
  • Private Prompts: Keep proprietary prompts within your organization
  • Social Features: Like, download, and fork popular prompts
  • Discovery: Search and explore prompts created by others
// Listing prompts with filters
export async function listLangSmithPrompts(options?: {
isPublic?: boolean;
isArchived?: boolean;
query?: string;
}): Promise<LangSmithPrompt[]> {
const client = getLangSmithClient();
const prompts: LangSmithPrompt[] = [];

for await (const prompt of client.listPrompts(options)) {
prompts.push({
id: prompt.id,
fullName: prompt.full_name,
isPublic: prompt.is_public,
tags: prompt.tags,
numLikes: prompt.num_likes,
numDownloads: prompt.num_downloads,
// ... other metadata
});
}

return prompts;
}

3. Rich Metadata and Organization

Prompts in LangSmith can be enriched with metadata:

  • Tags: Categorize prompts by use case, domain, or team
  • Descriptions: Document the purpose and usage
  • README: Provide comprehensive documentation
  • Examples: Include sample inputs and outputs
type LangSmithPrompt {
id: String!
promptHandle: String!
fullName: String!
description: String
readme: String
tags: [String!]!
numCommits: Int!
lastCommitHash: String
createdAt: String!
updatedAt: String!
}

4. User Ownership and Access Control

LangSmith implements robust ownership and permission models:

export function ensureUserPromptIdentifier(
promptIdentifier: string,
userEmail: string,
): string {
const userHandle = toUserHandle(userEmail);

// Prompts are namespaced by owner
if (promptIdentifier.includes("/")) {
const [owner, ...rest] = promptIdentifier.split("/");

// Validate owner matches user
if (owner !== userHandle) {
throw new Error(`Prompt identifier owner does not match your handle`);
}
return promptIdentifier;
}

// Add user's handle as prefix
return `${userHandle}/${promptIdentifier}`;
}

5. Integration with LLM Workflows

LangSmith prompts integrate seamlessly into your application:

// Create a prompt
const prompt = await createLangSmithPrompt("user/customer-support-classifier", {
description: "Classifies customer support tickets",
tags: ["support", "classification", "production"],
isPublic: false,
});

// Push a new version
await pushLangSmithPrompt("user/customer-support-classifier", {
object: {
template: "Classify this ticket: {ticket_text}",
model: "gpt-4",
temperature: 0.2,
},
description: "Improved classification accuracy",
});

Best Practices for Prompt Management

1. Version Your Prompts Deliberately

Treat prompt changes with the same rigor as code changes:

  • Semantic Versioning: Use meaningful version identifiers
  • Commit Messages: Write clear descriptions of what changed and why
  • Test Before Merging: Validate prompts in development before promoting to production

2. Use Tags Strategically

Tags are powerful for organization and filtering:

const productionTags = [
"env:production",
"use-case:classification",
"model:gpt-4",
"team:support",
"user:alice@example.com",
];

Recommended Tag Categories:

  • Environment: env:dev, env:staging, env:prod
  • Use Case: use-case:summarization, use-case:extraction
  • Model: model:gpt-4, model:claude-3
  • Owner: owner:team-name, user:email@domain.com
  • Status: status:experimental, status:stable

3. Document Your Prompts

Good documentation is essential for team collaboration:

await createLangSmithPrompt("user/email-generator", {
description: "Generates professional email responses",
readme: `
# Email Generator Prompt

## Purpose
Generates professional, context-aware email responses for customer inquiries.

## Input Format
- customer_name: String
- inquiry_type: "support" | "sales" | "billing"
- context: String (previous conversation)

## Output Format
Professional email in plain text

## Performance Notes
- Works best with GPT-4
- Temperature: 0.7 for natural variation
- Max tokens: 500

## Examples
See attached examples for common scenarios.
`,
tags: ["communication", "email", "production"],
});

4. Implement User-Based Filtering

When building multi-tenant applications, filter prompts by user:

// From the GraphQL resolver
const userPrompts = allPrompts.filter((prompt) => {
// Check user tags
const hasUserTag = prompt.tags.some(
(tag) =>
tag.includes(`user:${context.userEmail}`) ||
tag.includes(`owner:${context.userEmail}`),
);

// Check if owner matches
const isOwner = prompt.owner === context.userEmail;

return hasUserTag || isOwner;
});

5. Handle Permissions Gracefully

LangSmith requires specific API permissions. Handle errors clearly:

export async function createLangSmithPrompt(
promptIdentifier: string,
input?: CreateLangSmithPromptInput,
): Promise<LangSmithPrompt> {
try {
return await client.createPrompt(promptIdentifier, input);
} catch (error: any) {
if (
error?.message?.includes("403") ||
error?.message?.includes("Forbidden")
) {
throw new Error(
"LangSmith API key lacks 'Prompt Engineering' permissions. " +
"Please generate a new API key with Read, Write, AND " +
"Prompt Engineering scopes.",
);
}
throw error;
}
}

6. Use the Hub for Reusable Prompts

LangSmith Hub allows sharing prompts across projects:

  • Fork Popular Prompts: Start with community-tested templates
  • Share Successful Patterns: Contribute back to the community
  • Cross-Project Reuse: Reference the same prompt from multiple applications

Architectural Patterns

Pattern 1: GraphQL Wrapper for Type Safety

Wrap the LangSmith SDK with a GraphQL layer for type-safe client access:

extend type Mutation {
pushLangSmithPrompt(
promptIdentifier: String!
input: PushLangSmithPromptInput
): String!
}

input PushLangSmithPromptInput {
object: JSON
parentCommitHash: String
description: String
tags: [String!]
isPublic: Boolean
}

Benefits:

  • Type safety across frontend and backend
  • Centralized permission checking
  • Consistent error handling
  • Easy to mock for testing

Pattern 2: Singleton Client Pattern

Use a singleton to manage the LangSmith client:

let singleton: Client | null = null;

export function getLangSmithClient(): Client {
if (!singleton) {
const apiKey = process.env.LANGSMITH_API_KEY;
if (!apiKey) {
throw new Error("LANGSMITH_API_KEY required");
}
singleton = new Client({ apiKey });
}
return singleton;
}

Benefits:

  • Single point of configuration
  • Connection pooling
  • Consistent client instance across requests

Pattern 3: User Namespace Enforcement

Automatically namespace prompts by user to prevent conflicts:

export function toUserHandle(userEmail: string): string {
return userEmail
.trim()
.toLowerCase()
.replace(/[^a-z0-9@._-]+/g, "-")
.replace(/@/g, "-at-")
.replace(/\./g, "-");
}

// alice@company.com -> alice-at-company-com/my-prompt
const identifier = `${toUserHandle(userEmail)}/${promptName}`;

Integration with Your Application

Step 1: Environment Setup

# .env.local
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_API_URL=https://api.smith.langchain.com

Step 2: Client Initialization

import { Client } from "langsmith";

const client = new Client({
apiKey: process.env.LANGSMITH_API_KEY,
apiUrl: process.env.LANGSMITH_API_URL,
});

Step 3: Create and Version Prompts

// Initial creation
await client.pushPrompt("my-org/summarizer", {
object: {
_type: "prompt",
input_variables: ["text"],
template: "Summarize: {text}",
},
tags: ["v1", "production"],
isPublic: false,
});

// Update with new version
await client.pushPrompt("my-org/summarizer", {
object: {
_type: "prompt",
input_variables: ["text", "style"],
template: "Summarize in {style} style: {text}",
},
tags: ["v2", "production"],
isPublic: false,
});

Step 4: Retrieve and Use Prompts

// Get latest version
const commit = await client.pullPromptCommit("my-org/summarizer");

// Use the prompt
const template = commit.manifest.template;
const rendered = template
.replace("{text}", article)
.replace("{style}", "concise");

Monitoring and Analytics

LangSmith provides built-in analytics for prompts:

  • Usage Tracking: See how often each prompt is used
  • Performance Metrics: Track latency and success rates
  • Version Comparison: Compare metrics across versions
  • Cost Analysis: Monitor token usage per prompt
interface LangSmithPrompt {
numViews: number;
numDownloads: number;
numLikes: number;
numCommits: number;
lastUsedAt?: string;
}

Common Pitfalls and Solutions

Pitfall 1: Missing Permissions

Problem: API key doesn't have "Prompt Engineering" scope

Solution:

// Always check permissions and provide clear error messages
if (error?.message?.includes("403")) {
throw new Error(
"Generate API key with Prompt Engineering permissions at " +
"https://smith.langchain.com/settings",
);
}

Pitfall 2: Unbounded Listing

Problem: Listing all prompts can be slow or timeout

Solution:

// Limit results and provide pagination
const MAX_PROMPTS = 100;
for await (const prompt of client.listPrompts(options)) {
prompts.push(prompt);
if (prompts.length >= MAX_PROMPTS) break;
}

Pitfall 3: Naming Conflicts

Problem: Multiple users trying to create prompts with same name

Solution:

// Always namespace by user/organization
const namespaced = `${organization}/${promptName}`;
await client.createPrompt(namespaced, options);

Advanced Features

Prompt Playground Integration

LangSmith provides a visual playground for testing prompts:

  1. Edit prompt templates interactively
  2. Test with sample inputs
  3. Compare outputs across models
  4. Iterate quickly without code changes
  5. Save successful variations as new commits

Example Sets

Attach example inputs/outputs to prompts:

await client.pushPrompt("classifier", {
object: promptTemplate,
examples: [
{
inputs: { text: "I love this product!" },
outputs: { sentiment: "positive" },
},
{
inputs: { text: "Terrible experience" },
outputs: { sentiment: "negative" },
},
],
});

Labels for Deployment Stages

Use labels to mark deployment stages:

// Tag specific commits for each environment
await client.updatePrompt("my-prompt", {
tags: ["production", "v2.1.0"],
});

Migration Strategy

Migrating from Hardcoded Prompts

  1. Audit: Identify all prompts in your codebase
  2. Extract: Move prompts to LangSmith
  3. Refactor: Replace hardcoded strings with LangSmith fetches
  4. Test: Validate behavior matches
  5. Deploy: Roll out gradually with feature flags

Example migration:

// Before
const prompt = "Summarize this text: {text}";

// After
const commit = await client.pullPromptCommit("my-org/summarizer");
const prompt = commit.manifest.template;

Migrating from Custom Storage

  1. Export: Extract prompts from your current system
  2. Bulk Create: Use LangSmith API to create prompts
  3. Preserve History: Import version history as commits
  4. Update References: Point code to LangSmith
  5. Deprecate: Phase out old system

Conclusion

LangSmith's prompt management capabilities bring professional software engineering practices to the world of LLM applications. By treating prompts as versioned, collaborative artifacts, teams can:

  • Move Faster: Test and iterate on prompts without fear
  • Collaborate Better: Work together on prompts with clear ownership
  • Deploy Safely: Roll back problematic changes instantly
  • Scale Confidently: Manage hundreds of prompts across projects
  • Share Knowledge: Learn from community-tested patterns

As AI applications grow in complexity, proper prompt management becomes not just a nice-to-have but a necessity. LangSmith provides the infrastructure to manage this critical aspect of your AI stack.


Additional Resources

Langfuse Features: Prompts, Tracing, Scores, Usage

· 11 min read
Vadim Nicolai
Senior Software Engineer

A comprehensive guide to implementing Langfuse features for production-ready AI applications, covering prompt management, tracing, evaluation, and observability.

Overview

This guide covers:

  • Prompt management with caching and versioning
  • Distributed tracing with OpenTelemetry
  • User feedback and scoring
  • Usage tracking and analytics
  • A/B testing and experimentation

OpenRouter Integration with DeepSeek

· 9 min read
Vadim Nicolai
Senior Software Engineer

This article documents the complete OpenRouter integration implemented in Nomadically.work, using DeepSeek models exclusively through a unified API.

Architecture Overview

Module Structure

Core Features

1. Provider Configuration

The provider layer handles OpenRouter API communication using the OpenAI SDK compatibility layer.

Implementation Details:

  • Uses @ai-sdk/openai package for API compatibility
  • Lazy-loaded provider instance to support testing without API key
  • Configurable reasoning tokens (default: 10,000 max_tokens)
  • Custom headers for analytics and tracking

2. DeepSeek Model Access

Five DeepSeek models are available through the integration:

Model Selection Guide:

  • DeepSeek Chat: General-purpose conversations, Q&A, text generation
  • DeepSeek R1: Complex reasoning, multi-step analysis, decision-making
  • DeepSeek Coder: Code generation, debugging, technical documentation
  • R1 Distill Qwen 32B: Faster inference for reasoning tasks
  • R1 Distill Llama 70B: High-quality reasoning with better performance

3. Agent Creation Patterns

Three patterns for creating agents with different levels of abstraction:

Pattern Comparison:

PatternUse CaseFlexibilitySetup Time
TemplatesQuick prototyping, demosLowSeconds
HelpersStandard agents with custom configMediumMinutes
DirectAdvanced use cases, full controlHighMinutes

4. Agent Template Flow

5. Configuration System

Usage Examples

Basic Agent Creation

import { agentTemplates } from "@/openrouter";

// Quick start with template
const assistant = agentTemplates.assistant();

const response = await assistant.generate([
{ role: "user", content: "What are remote work benefits?" },
]);

Custom Agent with Specific Model

import { createChatAgent, deepseekModels } from "@/openrouter";

// Using helper function
const jobClassifier = createChatAgent({
id: "job-classifier",
name: "Job Classifier",
instructions: "You are an expert at classifying job postings.",
model: "chat",
});

// Or using model directly
import { Agent } from "@mastra/core/agent";

const reasoningAgent = new Agent({
model: deepseekModels.r1(),
name: "Reasoning Agent",
instructions: "Think step by step about complex problems.",
});

Advanced Configuration

import { createOpenRouter, DEEPSEEK_MODELS } from "@/openrouter";

const customProvider = createOpenRouter({
reasoning: {
max_tokens: 15000,
},
headers: {
"HTTP-Referer": "https://nomadically.work",
"X-Title": "Job Platform AI",
},
});

const model = customProvider(DEEPSEEK_MODELS.R1);

Data Flow

Request Flow

Error Handling Flow

Integration Points

Mastra Agent Integration

Environment Configuration

Required Variables

# Core configuration
OPENROUTER_API_KEY=sk-or-v1-your-api-key-here

# Optional configuration
OPENROUTER_SITE_NAME="Nomadically.work"
OPENROUTER_SITE_URL="https://nomadically.work"

Deployment Flow

Performance Characteristics

Model Comparison

Benefits

OpenRouter Advantages

Testing Strategy

Test Coverage

Run tests with:

pnpm test:openrouter

Type Safety

TypeScript Types

Migration Path

From Direct DeepSeek SDK

Resources

Summary

This OpenRouter integration provides:

  • Unified API Access - Single interface for all DeepSeek models
  • Type-Safe - Full TypeScript support with compile-time validation
  • Flexible - Three levels of abstraction for different use cases
  • Production-Ready - Error handling, fallbacks, and monitoring
  • Well-Tested - Comprehensive test suite with live API validation
  • Well-Documented - Complete examples and migration guides

The module is designed for scalability, maintainability, and developer experience while providing reliable access to state-of-the-art AI models through OpenRouter's infrastructure.

AI-Driven Company Enrichment with DeepSeek via Cloudflare Browser Rendering

· 4 min read
Vadim Nicolai
Senior Software Engineer

This page documents an AI-first enrichment pipeline that turns a company website into a clean, structured company profile you can safely persist into your database and expose through GraphQL.

The core idea is simple:

  • Use Cloudflare Browser Rendering /json to load a real rendered page (including JavaScript-heavy sites).
  • Use DeepSeek to convert the rendered page into a strict JSON-only object (no markdown, no prose).

High-level architecture

This pipeline has five clear layers, each with a single responsibility:

  • Entry: GraphQL mutation identifies the target company.
  • Acquisition: Browser Rendering fetches a fully rendered page.
  • Extraction: DeepSeek converts HTML into JSON-only structure.
  • Governance: validation, normalization, and audit snapshot.
  • Persistence: upserts for company + ATS boards, then return.

Classification

A single enum-like category so downstream logic can branch cleanly:

  • company.category is one of:
    • CONSULTANCY | AGENCY | STAFFING | DIRECTORY | PRODUCT | OTHER | UNKNOWN

UNKNOWN is intentionally allowed to prevent “forced certainty”.

Two links that unlock most automation:

  • company.careers_url — best official careers entrypoint (prefer internal)
  • company.linkedin_url — best LinkedIn company page (/company/...)

Hiring infrastructure

Detect ATS/job boards (useful for job syncing, vendor analytics, integrations):

  • ats_boards[] entries containing:
    • url
    • vendor
    • board_type (ats | careers_page | jobs_board)
    • confidence (0..1)
    • is_active

Provenance and uncertainty

To keep AI outputs accountable:

  • evidence — where it came from (URL) + any known fetch metadata
  • notes[] — uncertainty/caveats without polluting structured fields

Top-down architecture


Why Cloudflare Browser Rendering /json is the right AI boundary

Many company websites are JS-heavy (SPAs), and the key links you want (Careers, LinkedIn, ATS) often live in:

  • global navigation/header
  • footer “social” section
  • content that only appears after JS renders

The /json endpoint is designed to extract structured JSON from the rendered page, using:

  • url (or html)
  • a prompt (and optionally response_format for JSON Schema depending on provider support)
  • custom_ai to route extraction through your chosen model

For JS-heavy pages, waiting for rendering to finish matters. This is why the extractor uses:

  • gotoOptions.waitUntil = "networkidle0"

AI contract: JSON-only output

When you route through custom_ai with BYO providers, schema-enforced responses can be provider-dependent. The safest universal strategy is:

  • treat the prompt as a strict contract
  • demand ONLY valid JSON
  • define the expected shape explicitly
  • instruct null/[] for unknown values
  • push uncertainty into notes[]

This turns an LLM into a bounded parser.


Implementation: Cloudflare-first with a direct DeepSeek fallback

Below is the same flow, expressed as architecture instead of code:

  • Inputs: company id/key and target URL.
  • Acquisition: Browser Rendering /json fetches a rendered page.
  • Extraction: DeepSeek produces a JSON-only record.
  • Governance: validate, normalize, and snapshot the output.
  • Persistence: upsert company + ATS boards, then return result.

Persistence guardrails (keep the AI safe)

Even with JSON-only output, the DB write must remain your code’s responsibility.

1) Validate shape before persistence

At minimum, verify:

  • company.name exists and is non-empty
  • any present URLs are absolute (https://...)
  • arrays are arrays
  • category is one of the allowed values

If validation fails, either retry extraction (stricter prompt) or fall back.

2) Canonicalize URLs before upserts

To avoid duplicates, normalize:

  • remove #fragment
  • normalize trailing slash
  • lowercase host
  • optionally strip tracking params

3) Treat vendor and board_type as hints

LLMs can emit vendor variants (e.g., Smart Recruiters, smartrecruiters). Normalize before mapping to enums.

4) Always snapshot the raw extraction

Saving the full ExtractionResult into companySnapshots.extracted buys you:

  • debugging (“why did this change?”)
  • regression detection
  • prompt iteration without losing history

References

https://github.com/nicolad/nomadically.work

https://nomadically.work/

Agent Skills spec + Mastra integration

· 9 min read
Vadim Nicolai
Senior Software Engineer

Agent Skills Specification

Source: https://agentskills.io/specification

This document defines the Agent Skills format.

Directory structure

A skill is a directory containing at minimum a SKILL.md file:

skill-name/
└── SKILL.md # Required

Tip: You can optionally include additional directories such as scripts/, references/, and assets/ to support your skill.

SKILL.md format

The SKILL.md file must contain YAML frontmatter followed by Markdown content.

Frontmatter (required)

Minimal example:

---
name: skill-name
description: A description of what this skill does and when to use it.
---

With optional fields:

---
name: pdf-processing
description: Extract text and tables from PDF files, fill forms, merge documents.
license: Apache-2.0
metadata:
author: example-org
version: "1.0"
---
FieldRequiredNotes
nameYesMax 64 characters. Lowercase letters, numbers, and hyphens only. Must not start or end with a hyphen.
descriptionYesMax 1024 characters. Non-empty. Describes what the skill does and when to use it.
licenseNoLicense name or reference to a bundled license file.
compatibilityNoMax 500 characters. Indicates environment requirements (intended product, system packages, network access, etc.).
metadataNoArbitrary key-value mapping for additional metadata.
allowed-toolsNoSpace-delimited list of pre-approved tools the skill may use. (Experimental)
name field

The required name field:

  • Must be 1-64 characters
  • May only contain unicode lowercase alphanumeric characters and hyphens (a-z and -)
  • Must not start or end with -
  • Must not contain consecutive hyphens (--)
  • Must match the parent directory name

Valid examples:

name: pdf-processing
name: data-analysis
name: code-review

Invalid examples:

name: PDF-Processing  # uppercase not allowed
name: -pdf  # cannot start with hyphen
name: pdf--processing  # consecutive hyphens not allowed
description field

The required description field:

  • Must be 1-1024 characters
  • Should describe both what the skill does and when to use it
  • Should include specific keywords that help agents identify relevant tasks

Good example:

description: Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.

Poor example:

description: Helps with PDFs.
license field

The optional license field:

  • Specifies the license applied to the skill
  • We recommend keeping it short (either the name of a license or the name of a bundled license file)

Example:

license: Proprietary. LICENSE.txt has complete terms
compatibility field

The optional compatibility field:

  • Must be 1-500 characters if provided
  • Should only be included if your skill has specific environment requirements
  • Can indicate intended product, required system packages, network access needs, etc.

Examples:

compatibility: Designed for Claude Code (or similar products)
compatibility: Requires git, docker, jq, and access to the internet

Note: Most skills do not need the compatibility field.

metadata field

The optional metadata field:

  • A map from string keys to string values
  • Clients can use this to store additional properties not defined by the Agent Skills spec
  • We recommend making your key names reasonably unique to avoid accidental conflicts

Example:

metadata:
author: example-org
version: "1.0"
allowed-tools field

The optional allowed-tools field:

  • A space-delimited list of tools that are pre-approved to run
  • Experimental. Support for this field may vary between agent implementations

Example:

allowed-tools: Bash(git:*) Bash(jq:*) Read

Body content

The Markdown body after the frontmatter contains the skill instructions. There are no format restrictions. Write whatever helps agents perform the task effectively.

Recommended sections:

  • Step-by-step instructions
  • Examples of inputs and outputs
  • Common edge cases

Note: The agent will load this entire file once it's decided to activate a skill. Consider splitting longer SKILL.md content into referenced files.

Optional directories

scripts/

Contains executable code that agents can run. Scripts should:

  • Be self-contained or clearly document dependencies
  • Include helpful error messages
  • Handle edge cases gracefully

Supported languages depend on the agent implementation. Common options include Python, Bash, and JavaScript.

references/

Contains additional documentation that agents can read when needed:

  • REFERENCE.md - Detailed technical reference
  • FORMS.md - Form templates or structured data formats
  • Domain-specific files (finance.md, legal.md, etc.)

Keep individual reference files focused. Agents load these on demand, so smaller files mean less use of context.

assets/

Contains static resources:

  • Templates (document templates, configuration templates)
  • Images (diagrams, examples)
  • Data files (lookup tables, schemas)

Progressive disclosure

Skills should be structured for efficient use of context:

  1. Metadata (~100 tokens): The name and description fields are loaded at startup for all skills
  2. Instructions (< 5000 tokens recommended): The full SKILL.md body is loaded when the skill is activated
  3. Resources (as needed): Files (e.g. those in scripts/, references/, or assets/) are loaded only when required

Keep your main SKILL.md under 500 lines. Move detailed reference material to separate files.

File references

When referencing other files in your skill, use relative paths from the skill root:

See [the reference guide](references/REFERENCE.md) for details.

Run the extraction script:
scripts/extract.py

Keep file references one level deep from SKILL.md. Avoid deeply nested reference chains.

Validation

Use the skills-ref reference library to validate your skills:

skills-ref validate ./my-skill

This checks that your SKILL.md frontmatter is valid and follows all naming conventions.


Documentation index first

The Agent Skills docs are designed to be discovered via a single index file (llms.txt). Use that as the entrypoint whenever you’re exploring the spec surface area.


What are skills?

Agent Skills are a lightweight, file-based format for packaging reusable agent instructions and workflows (plus optional scripts/assets). Agents use progressive disclosure:

  1. Discovery: load only name + description metadata
  2. Activation: load the full SKILL.md body for a matching task
  3. Execution: read references / run scripts as needed
    oaicite:1

Skill directory structure

Minimum required:

skill-name/
└── SKILL.md

Common optional directories (same convention is used by Mastra workspaces):

skill-name/
├── SKILL.md
├── references/ # extra docs (optional)
├── scripts/ # executable code (optional)
└── assets/ # templates/images/etc. (optional)

SKILL.md specification essentials

Frontmatter requirements

SKILL.md must start with YAML frontmatter with at least:

  • name (strict naming constraints; should match the folder name)
  • description (non-empty; should say what + when; include “trigger keywords”)

Optional fields defined by the spec include license, compatibility, metadata, and experimental allowed-tools.

Body content

After frontmatter: normal Markdown instructions. The spec recommends practical steps, examples, and edge cases (and keeping SKILL.md reasonably small to support progressive disclosure).

A spec-friendly template

---
name: code-review
description: Reviews code for quality, style, and potential issues. Use when asked to review PRs, diffs, TypeScript/Node projects, or linting failures.
license: Apache-2.0
compatibility: Requires node and access to repository files
metadata:
version: "1.0.0"
tags: "development review"
---

# Code Review

## When to use this skill
- Trigger phrases: "review this PR", "code review", "lint errors", "style guide"

## Procedure
1. Identify the change scope and risk.
2. Check for correctness, edge cases, and error handling.
3. Verify style rules in references/style-guide.md.
4. If available, run scripts/lint.ts and summarize results.

## Output format
- Summary
- Issues (by severity)
- Suggested diffs
- Follow-ups/tests

Note: Mastra’s docs show version and tags as top-level keys in frontmatter. Depending on your validator/tooling, the safest cross-implementation choice is to store extras under metadata. (mastra.ai)


Mastra integration

Mastra workspaces support skills starting in @mastra/core@1.1.0. (mastra.ai)

1) Place skills under your workspace filesystem basePath

Mastra treats skill paths as relative to the workspace filesystem basePath. (mastra.ai)

In your repo, the main workspace is configured with:

  • basePath: "./src/workspace"
  • skills: ["/skills"]

That means the actual on-disk skills folder should be:

./src/workspace/skills/
/your-skill-name/
SKILL.md

2) Configure skills on a workspace

Mastra enables discovery by setting skills on the workspace. (mastra.ai)

import { Workspace, LocalFilesystem } from "@mastra/core/workspace";

export const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: "./src/workspace" }),
skills: ["/skills"],
});

You can provide multiple skill directories (still relative to basePath). (mastra.ai)

skills: [
"/skills", // Project skills
"/team-skills", // Shared team skills
],

3) Dynamic skill directories (context-aware)

Mastra also supports a function form for skills, so you can vary skill sets by user role, tenant, environment, etc. (mastra.ai)

skills: (context) => {
const paths = ["/skills"];
if (context.user?.role === "developer") paths.push("/dev-skills");
return paths;
},

4) What Mastra does “under the hood”

When a skill is activated, its instructions are added to the conversation context and the agent can access references/scripts in that skill folder. Mastra describes the runtime flow as: (mastra.ai)

  1. List available skills in the system message
  2. Allow agents to activate skills during conversation
  3. Provide access to skill references and scripts

This maps cleanly onto the Agent Skills “discovery → activation → execution” model. (agentskills.io)

5) Skill search and indexing in Mastra

Mastra workspaces support BM25, vector, and hybrid search. (mastra.ai)

If BM25 or vector search is enabled, Mastra will automatically index skills so agents can search within skill content to find relevant instructions. (mastra.ai)

Example (BM25-only):

const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: "./src/workspace" }),
skills: ["/skills"],
bm25: true,
});

If you enable vector or hybrid search, indexing uses your embedder and vector store (and BM25 uses tokenization + term statistics). (mastra.ai)


Repo conventions that work well

  • One skill per folder, folder name matches frontmatter.name.

  • Keep SKILL.md focused on the “operator manual”; push deep theory to references/.

  • Put runnable helpers in scripts/ and make them deterministic (clear inputs/outputs).

  • Treat destructive actions as opt-in:

    • Use workspace tool gating (approval required, delete disabled) for enforcement.
    • Optionally declare allowed-tools in SKILL.md for portability across other skill runtimes. (agentskills.io)

AI-Powered Skill Extraction with Cloudflare Embeddings and a Vector Taxonomy

· 4 min read
Vadim Nicolai
Senior Software Engineer

This bulk processor extracts structured skill tags for job postings using an AI pipeline that combines:

  • Embedding generation via Cloudflare Workers AI (@cf/baai/bge-small-en-v1.5, 384-dim)
  • Vector retrieval over a skills taxonomy (Turso/libSQL index skills_taxonomy) for candidate narrowing
  • Mastra workflow orchestration for LLM-based structured extraction + validation + persistence
  • Production-grade run controls: robust logging, progress metrics, graceful shutdown, and per-item failure isolation

It’s designed for real-world runs where you expect rate limits, transient failures, and safe restarts.


Core constraint: embedding dimension ↔ vector index schema

The taxonomy retrieval layer is backed by a Turso/libSQL vector index:

  • Index name: skills_taxonomy
  • Embedding dimension (required): 384
  • Embedding model: @cf/baai/bge-small-en-v1.5 (384-dim)

If the index dimension isn’t 384, vector search can fail or degrade into meaningless similarity scores.
The script prevents this by validating stats.dimension === 384 before processing.


Architecture overview (pipeline flow)


Retrieval + extraction: what happens per job

  • Convert relevant job text to embeddings using Cloudflare Workers AI.
  • Use vector similarity search in skills_taxonomy to retrieve top-N candidate skills.
  • Candidates constrain the downstream LLM step (better precision, lower cost).

2) Extraction: structured inference via Mastra workflow

A cached Mastra workflow (extractJobSkillsWorkflow) performs:

  • prompt + schema-driven extraction
  • normalization (matching to taxonomy terms/ids)
  • validation (reject malformed outputs)
  • persistence into job_skill_tags

On failure, the script logs workflow status and step details for debugging.


Cloudflare Workers AI embeddings

Model contract and hardening

  • Model: @cf/baai/bge-small-en-v1.5
  • Vectors: 384 dimensions
  • Input contract: strict array of strings
  • Timeout: 45s (AbortController)
  • Output contract: explicit response shape checks (fail early on unexpected payloads)

This is important because embedding pipelines can silently drift if the response shape changes or inputs are malformed.

Dimension enforcement (non-negotiable)

If skills_taxonomy was created/seeded with a different dimension:

  • similarity search becomes invalid (best case: errors; worst case: plausible-but-wrong matches)

The script enforces stats.dimension === 384 to keep retrieval semantically meaningful.


Turso/libSQL vector taxonomy index

  • Storage: Turso (libSQL)
  • Index: skills_taxonomy
  • Schema dimension: 384
  • Role: retrieval layer for skills ontology/taxonomy

The script also ensures the index is populated (count > 0), otherwise it fails fast and directs you to seed.


Reliability and operational controls

Observability: console + file tee logs

  • tees console.log/warn/error to a timestamped file and the terminal
  • log naming: extract-job-skills-<ISO timestamp>-<pid>.log
  • degrades to console-only logging if file IO fails

Graceful termination

  • SIGINT / SIGTERM sets a shouldStop flag
  • the loop exits after the current job completes
  • avoids interrupting in-flight workflow steps (embedding/LLM/DB writes)

Idempotency / restart safety

Even after selecting jobs without tags, the script re-checks:

  • jobAlreadyHasSkills(jobId)

This avoids duplicate inference when:

  • you restart mid-run
  • multiple workers run concurrently
  • the initial query snapshot becomes stale

Throughput shaping

  • sequential processing
  • a fixed 1s backoff between jobs (simple, reliable rate-limit mitigation)

Failure modes

Retrieval layer failures (index health)

Triggers:

  • index missing
  • dimension mismatch (not 384)
  • empty index (count === 0)

Behavior: fail fast with actionable logs (recreate index / re-seed / verify DB target).

Embedding timeouts

Symptom: embedding call exceeds 45s and aborts. Behavior: job fails; run continues.

Mitigations:

  • chunk long descriptions upstream
  • add retry/backoff on transient 429/5xx
  • monitor Workers AI service health

Workflow failures

Behavior: job is marked failed; run continues. Logs include step trace and error payload to accelerate debugging.


Quick reference

  • Embeddings: Cloudflare Workers AI @cf/baai/bge-small-en-v1.5 (384-dim)
  • Retrieval: Turso/libSQL vector index skills_taxonomy (384-dim)
  • Orchestration: Mastra workflow extractJobSkillsWorkflow
  • Persistence: job_skill_tags
  • Embedding timeout: 45s
  • Stop behavior: graceful after current job (SIGINT / SIGTERM)

AI Observability for LLM Evals with Langfuse

· 10 min read
Vadim Nicolai
Senior Software Engineer

This article documents an evaluation harness for a Remote EU job classifier—but the real focus is AI observability: how to design traces, spans, metadata, scoring, and run-level grouping so you can debug, compare, and govern LLM behavior over time.

The script runs a batch of curated test cases, loads the latest production prompt from Langfuse (with a safe fallback), executes a structured LLM call, scores results, and publishes metrics back into Langfuse. That gives you:

  • Reproducibility (prompt versions + test set + session IDs)
  • Debuggability (one trace per test case; inspect inputs/outputs)
  • Comparability (run-level aggregation; trend metrics across changes)
  • Operational safety (flush guarantees, CI thresholds, rate-limit control)

Why "observability-first" evals matter

A typical eval script prints expected vs actual and calls it a day. That's not enough once you:

  • iterate prompts weekly,
  • swap models,
  • add guardrails,
  • change schemas,
  • tune scoring rubrics,
  • and need to explain regressions to humans.

Observability-first evals answer questions like:

  • Which prompt version produced the regression?
  • Is accuracy stable but confidence becoming overconfident?
  • Are failures clustered by location phrasing ("EMEA", "EU timezone", "Worldwide")?
  • Did we increase tokens/latency without improving correctness?
  • Can we click from CI logs straight into the trace of the failing example?

Langfuse becomes your "flight recorder": the trace is the unit of truth for what happened.


End-to-end architecture


Observability design: what gets traced and why

Trace strategy: one trace per test case

Principle: if you can't click into an individual example, you can't debug.

Each test case produces a Langfuse trace (think "request-level unit"), tagged with:

  • sessionId: groups a full run (critical for comparisons)
  • testCaseId, description: anchors the trace to your dataset
  • prompt metadata: name/label/version/hash (ideal)
  • model metadata: provider, model name, parameters (ideal)

This makes failures navigable and filterable.

Span strategy: one generation per model call

Inside each trace, you create a generation span for the model call:

  • captures input (prompt + job posting)
  • captures output (structured object + reason)
  • captures usage (token counts)
  • optionally captures latency (recommended)
  • optionally captures model params (temperature, top_p, etc.)

Even if the script is "just evals," treat each example like production traffic. That's how you build a reliable debugging workflow.


Prompt governance: Langfuse prompts + fallback behavior

Your harness fetches a prompt by name and label:

  • name: job-classifier
  • label: production

If prompt retrieval fails or is disabled (e.g., SKIP_LANGFUSE_PROMPTS=true), it uses a local fallback prompt.

Observability tip: always record the effective prompt identity

To compare runs, you want "which exact prompt did this use?" in trace metadata. If your prompt fetch returns versioning info, store:

  • promptName
  • promptLabel
  • promptVersion or promptId or promptHash

If it does not return version info, you can compute a stable hash of the prompt text and store that (lightweight, extremely useful).


Structured output: Zod as an observability contract

The classifier returns:

  • isRemoteEU: boolean
  • confidence: "high" | "medium" | "low"
  • reason: string

Why structured output is observability, not just "parsing"

A strict schema:

  • removes ambiguity ("was that JSON-ish text or valid data?")
  • enables stable scoring and aggregation
  • prevents downstream drift as prompts change
  • improves triage because the same fields are always present

If you ever add fields like region, countryHints, remotePolicy, do it via schema extension and keep historical compatibility in your scorer.


The full eval lifecycle as a trace model

This is what you want stored per test case:

When a case fails, you should be able to answer in one click:

  • Which prompt version?
  • What input text exactly?
  • What output object exactly?
  • What scoring decision and why?
  • Was the model "confidently wrong"?

Scoring and metrics: accuracy is necessary, insufficient

Your harness logs two scores:

  1. remote-eu-accuracy A numeric score from your scorer. This can be binary (0/1) or continuous (0..1). Continuous is often better because it supports partial credit and more informative trend analysis.

  2. confidence-match A binary score (1/0) tracking whether the model's confidence matches expected confidence.

Observability tip: store scorer metadata as the comment (or trace metadata)

A score without context is hard to debug. For incorrect cases, write comments like:

  • expected vs actual isRemoteEU
  • expected vs actual confidence
  • a short reason ("Predicted EU-only due to 'EMEA' but posting says US time zones")

Also consider storing structured metadata (if your Langfuse SDK supports it) so you can filter/group later.


Run-level grouping: session IDs as your "eval run" primitive

A sessionId = eval-${Date.now()} groups the whole batch. This enables:

  • "show me all traces from the last run"
  • comparisons across runs
  • slicing by prompt version across sessions
  • CI links that land you on the run dashboard

Recommendation: include additional stable tags:

  • gitSha, branch, ciBuildId (if running in CI)
  • model and promptVersion (for quick comparisons)

Even if you don't have them now, design the metadata schema so adding them later doesn't break anything.


Mermaid: evaluation flow, sequence, and data model (together)

1) Flow: control plane of the batch run

2) Sequence: what actually happens per case

3) Data model: eval artifacts


How to run (and make it debuggable in one click)

Environment variables

Required:

  • LANGFUSE_SECRET_KEY
  • LANGFUSE_PUBLIC_KEY
  • LANGFUSE_BASE_URL
  • DEEPSEEK_API_KEY

Optional:

  • SKIP_LANGFUSE_PROMPTS=true (use local prompt fallback)

Run:

pnpm tsx scripts/eval-remote-eu-langfuse.ts

Local prompt fallback:

SKIP_LANGFUSE_PROMPTS=true pnpm tsx scripts/eval-remote-eu-langfuse.ts

Observability tip: print a stable "run header"

In console output (and CI logs), it helps to print:

  • sessionId
  • model name
  • prompt version/hash
  • total test cases

That turns logs into an index into Langfuse.


Debugging workflow: from CI failure to root cause

When accuracy drops below threshold and CI fails, you want a deterministic workflow:

  1. Open the Langfuse session for the run (grouped by sessionId)

  2. Filter traces where remote-eu-accuracy = 0 (or below some threshold)

  3. For each failing trace:

    • check prompt version/hash
    • check job posting input text (location phrasing is often the culprit)
    • inspect structured output (especially confidence)
    • read the reason for the scorer's decision

Practical tips & gotchas (observability edition)

1) Always flush telemetry

If you exit early, you can lose the most important traces. Ensure flushAsync() happens even on errors (e.g., in a finally block) and only exit after flush completes.

2) Don't parallelize blindly

Parallel execution improves speed but can:

  • amplify rate limits
  • introduce noisy latency
  • create non-deterministic output ordering in logs

If you do parallelize, use bounded concurrency and capture per-case timing.

3) Track prompt identity, not just prompt text

Prompt text alone is hard to compare across runs. Record version/hash so you can correlate changes with performance.

4) Separate "correctness" from "calibration"

A model can get higher accuracy while becoming confidently wrong on edge cases. Keeping confidence-match (or richer calibration metrics later) prevents hidden regressions.

5) Add slice metrics before you add more test cases

Instead of only "overall accuracy," compute accuracy by category:

  • "EU-only"
  • "Worldwide remote"
  • "EMEA" phrasing
  • "Hybrid" / "On-site"
  • "Contractor / employer-of-record constraints"

This reveals what's actually breaking when a prompt changes.


Suggested next upgrades (high leverage)

A) Add latency and cost proxies

Record:

  • duration per generation span (ms)
  • token totals per case

Then you can chart:

  • cost/latency vs accuracy
  • regressions where prompt got longer but not better

B) Add a "reason quality" score (optional, small rubric)

Create a third score like reason-quality to detect when explanations degrade (too vague, irrelevant, or missing key constraints). Keep it light—don't overfit to phrasing.

C) Prompt A/B within the same run

Evaluate production vs candidate prompts on the same test set:

  • two sessions (or two labels within one session)
  • compare metrics side-by-side in Langfuse

Docusaurus note: Mermaid support

If Mermaid isn't rendering, enable it in Docusaurus:

// docusaurus.config.js
const config = {
markdown: { mermaid: true },
themes: ["@docusaurus/theme-mermaid"],
};
module.exports = config;

The takeaway: observability is the eval superpower

A well-instrumented eval harness makes improvements measurable and regressions explainable:

  • traces turn examples into clickable evidence
  • structured outputs stabilize scoring
  • session IDs make runs comparable
  • multiple metrics prevent hidden failure modes

If you treat evals like production requests—with traces, spans, and scores—you'll iterate faster and break less.

Schema-First RAG with Eval-Gated Grounding and Claim-Card Provenance

· 7 min read
Vadim Nicolai
Senior Software Engineer

This article documents a production-grade architecture for generating research-grounded therapeutic content. The system prioritizes verifiable artifacts (papers → structured extracts → scored outputs → claim cards) over unstructured text.

You can treat this as a “trust pipeline”: retrieve → normalize → extract → score → repair → persist → generate.

Evals for Workflow-First Production LLMs: Contracts, Rubrics, Sampling, and Observability

· 12 min read
Vadim Nicolai
Senior Software Engineer

Building Production Evals for LLM Systems

Building LLM systems you can measure, monitor, and improve

Large language models feel like software, but they don’t behave like software.

With conventional programs, behavior is mostly deterministic: if tests pass, you ship, and nothing changes until you change the code. With LLM systems, behavior can drift without touching a line—model updates, prompt edits, temperature changes, tool availability, retrieval results, context truncation, and shifts in real-world inputs all move the output distribution.

So “it seems to work” isn’t a strategy. Evals are how you turn an LLM feature from a demo into an engineered system you can:

  • Measure (quantify quality across dimensions)
  • Monitor (detect drift and regressions early)
  • Improve (pinpoint failure modes and iterate)

This doc builds evals from first principles and anchors everything in a concrete example: a workflow that classifies job postings as Remote EU (or not), outputs a structured JSON contract, and attaches multiple scorers (deterministic + LLM-as-judge) to generate reliable evaluation signals.


1) The core idea: make quality observable

An eval is a function:

Eval(input, output, context?, ground_truth?) → score + reason + metadata

A single scalar score is rarely enough. You want:

  • Score: for trendlines, comparisons, and gating
  • Reason: for debugging and iteration
  • Metadata: to reproduce and slice results (model version, prompt version, retrieval config, toolset, sampling rate, time)

When you do this consistently, evals become the LLM equivalent of:

  • unit tests + integration tests,
  • observability (logs/metrics/traces),
  • QA plus post-release monitoring.

2) “Correct” is multi-dimensional

In LLM systems, quality is a vector.

Even if the final label is right, the output can still be unacceptable if:

  • it invents support in the explanation (hallucination),
  • it violates the rubric (misalignment),
  • it fails formatting constraints (schema noncompliance),
  • it’s unhelpful or vague (low completeness),
  • it includes unsafe content (safety).

So you don’t build one eval. You build a panel of scorers that measure different axes.


3) Deterministic vs model-judged evals

3.1 Deterministic evals (cheap, stable, strict)

No model involved. Examples:

  • schema validation
  • required fields present (e.g., reason non-empty)
  • bounds checks (confidence ∈ {low, medium, high})
  • regex checks (must not include disallowed fields)

Strengths: fast, repeatable, low variance Limitations: shallow; can’t grade nuance like “is this reason actually supported?”

3.2 LLM-as-judge evals (powerful, fuzzy, variable)

Use a second model (the judge) to grade output against a rubric and evidence.

Strengths: can evaluate nuanced properties like grounding, rubric adherence, and relevance Limitations: cost/latency, judge variance, judge drift, and susceptibility to prompt hacking if unconstrained

In production, the winning pattern is: deterministic guardrails + rubric-based judge scoring + sampling.


4) The “Remote EU” running example

4.1 Task

Input:

  • title
  • location
  • description

Output contract:

{
"isRemoteEU": true,
"confidence": "high",
"reason": "Short evidence-based justification."
}

4.2 Why this is a great eval example

Job posts are full of ambiguous and misleading phrases:

  • “EMEA” is not EU-only
  • “CET/CEST” is a timezone, not eligibility
  • UK is not in the EU
  • Switzerland/Norway are in Europe but not EU
  • “Hybrid” is not fully remote
  • Multi-location lists can mix EU and non-EU constraints

This creates exactly the kind of environment where “vibes” fail and measurement matters.


5) Workflow-first evaluation architecture

A practical production architecture separates:

  • serving (fast path that returns a result),
  • measurement (scoring and diagnostics, often sampled).

Why this split matters

If your most expensive scorers run inline on every request, your feature inherits their cost and latency. A workflow-first approach gives you options:

  • always-on “must-have” scoring,
  • sampled deep diagnostics,
  • offline golden-set evaluation in CI.

6) Contracts make evaluation reliable: rubric + schema

6.1 Rubric is the spec

If you can’t state what “correct” means, you can’t measure it consistently.

Your rubric should define:

  • positive criteria (what qualifies),
  • explicit negatives (what disqualifies),
  • ambiguous cases and how to resolve them,
  • precedence rules (what overrides what).

6.2 Schema is the contract

Structured output makes evaluation composable:

  • score isRemoteEU separately from reason,
  • validate confidence vocabulary,
  • enforce required fields deterministically.

7) Design the scorer suite as a “sensor panel”

A robust suite typically includes:

7.1 Always-on core

  • Domain correctness judge (rubric-based)
  • Deterministic sanity (schema + hasReason)
  • Optionally: lightweight grounding check (if user-facing)

7.2 Sampled diagnostics

  • Faithfulness / hallucination (judge-based)
  • Prompt alignment
  • Answer relevancy
  • Completeness / keyword coverage (careful: can be gamed)

7.3 Low-rate tail-risk

  • Toxicity
  • Bias
  • (Domain-dependent) policy checks

8) The anchor metric: domain correctness as a strict judge

Generic “relevance” is not enough. You need:

“Is isRemoteEU correct under this rubric for this job text?”

8.1 What a good judge returns

A strong judge returns structured, actionable feedback:

  • score ∈ [0, 1]
  • isCorrect boolean
  • mainIssues[] (typed failure modes)
  • reasoning (short justification)
  • optional evidenceQuotes[] (snippets that support the judgment)

8.2 The “use only evidence” constraint

The most important instruction to judges:

Use ONLY the job text + rubric. Do not infer missing facts.

Without this, your judge will “helpfully” hallucinate implied constraints, and your metric becomes untrustworthy.


9) Deterministic sanity checks: tiny effort, huge payoff

Even with a schema, add simple checks:

  • reason.trim().length > 0
  • confidence in an allowed set
  • optional length bounds for reason (prevents rambling)

These are cheap, stable, and catch silent regressions early.


10) Grounding: the trust layer

In many real products, the worst failure is not “wrong label.” It’s unsupported justification.

A model can guess the right label but invent a reason. Users trust the reason more than the label. When the reason lies, trust is gone.

Useful grounding dimensions:

  • Faithfulness: does the reason match the job text?
  • Non-hallucination: does it avoid adding unsupported claims?
  • Context relevance: does it actually use provided context?

Normalize score direction

If a scorer returns “lower is better” (hallucination/toxicity), invert it so higher is always better:

This prevents endless mistakes in dashboards and thresholds.


11) Aggregation: how many metrics become decisions

You typically want three layers:

11.1 Hard gates (binary invariants)

Examples:

  • schema valid
  • hasReason = 1
  • correctness score ≥ threshold
  • non-hallucination ≥ threshold (if user-facing)

11.2 Soft composite score (trend tracking)

A weighted score helps compare versions, but should not hide hard failures.

11.3 Diagnostics (why it failed)

Store mainIssues[] and judge reasons so you can cluster and fix.


12) Slicing: where the real insight lives

A single global average is rarely useful. You want to slice by meaningful features:

For Remote EU:

  • contains “EMEA”
  • contains “CET/CEST”
  • mentions UK
  • mentions hybrid/on-site
  • mentions “Europe” (ambiguous)
  • multi-location list present
  • mentions “EU work authorization”

This turns “accuracy dropped” into “accuracy dropped specifically on CET-only job posts.”


13) The Remote EU rubric as a decision tree

A rubric becomes much easier to debug when you can visualize precedence rules.

Here’s an example decision tree (adapt to your policy):

This makes edge cases explicit and makes judge behavior easier to audit.


14) Sampling strategy: cost-aware measurement

A practical scoring policy:

  • Always-on: correctness + sanity
  • 25% sampled: grounding + alignment + completeness
  • 10% sampled: safety canaries
  • 0%: tool-call accuracy until you actually use tools

This gives you statistical visibility with bounded cost.

If you want deeper rigor:

  • increase sampling on releases,
  • reduce sampling during stable periods,
  • bias sampling toward risky slices (e.g., posts containing “EMEA” or “CET”).

15) Calibration: make “confidence” mean something

If you output confidence: high|medium|low, treat it as a measurable claim.

Track:

  • P(correct | high)
  • P(correct | medium)
  • P(correct | low)

A healthy confidence signal produces a separation like:

  • high ≫ medium ≫ low

If “high” is only marginally better than “medium,” you’re emitting vibes, not confidence.


16) Turning evals into improvement: the feedback loop

Evals are not a report card. They’re a loop.

  1. Collect runs + eval artifacts
  2. Cluster failures by mainIssues[]
  3. Fix prompt/rubric/routing/post-processing
  4. Re-run evals (golden set + sampled prod)
  5. Gate release based on regressions

The key operational shift: you stop debating anecdotes and start shipping changes backed by measured deltas.


17) Golden sets: fast regression detection

A golden set is a curated collection of test cases representing:

  • core behavior,
  • common edge cases,
  • historical failures.

Even 50–200 examples catch a shocking amount of regression.

For Remote EU, include cases mentioning:

  • “Remote EU only”
  • “Remote Europe” (ambiguous)
  • “EMEA only”
  • “CET/CEST only”
  • UK-only
  • Switzerland/Norway-only
  • hybrid-only (single city)
  • multi-location lists mixing EU and non-EU
  • “EU work authorization required” without explicit countries

Run the golden set:

  • on every prompt/model change (CI),
  • nightly as a drift canary.

18) Judge reliability: making LLM-as-judge dependable

Judge scoring is powerful, but you must treat the judge prompt like production code.

18.1 Techniques that reduce variance

  • force structured judge output (JSON schema)
  • use a clear rubric with precedence rules
  • include explicit negative examples
  • constrain the judge: “use only provided evidence”
  • keep judge temperature low
  • store judge prompt version + rubric version

18.2 Disagreement as signal

If you run multiple judges or compare judge vs deterministic heuristics, disagreement highlights ambiguous cases worth:

  • rubric refinement,
  • targeted prompt updates,
  • additional training data,
  • routing policies.

19) Production gating patterns

Not every system should block on evals, but you can safely gate high-risk cases.

Common gates:

  • schema invalid → retry
  • correctness below threshold → rerun with stronger model or request clarification (if user-facing)
  • low grounding score → regenerate explanation constrained to cite evidence
  • confidence low → route or mark uncertain

20) Beyond classifiers: evals for tool-using agents

Once your agent calls tools (search, databases, parsers, RAG), evals expand to include:

  • Tool selection correctness: did it call tools when needed?
  • Argument correctness: were tool parameters valid?
  • Faithful tool usage: did the model use tool outputs correctly?
  • Over-calling: did it waste calls?

This is where agentic systems often succeed or fail in production.


21) A practical checklist

Spec & contracts

  • Rubric defines positives, negatives, precedence, ambiguous cases
  • Output schema enforced
  • Prompt and rubric are versioned artifacts

Scorers

  • Always-on: correctness + sanity
  • Sampled: grounding + alignment + completeness
  • Low-rate: safety checks
  • Scores normalized so higher is better

Ops

  • Metrics stored with reasons + metadata
  • Slices defined for high-risk patterns
  • Golden set exists and runs in CI/nightly
  • Feedback loop ties evals directly to prompt/rubric/routing changes

Closing

Without evals, you can demo. With evals, you can ship—and keep shipping.

A workflow-first pattern—rubric + schema + domain correctness judge + grounding diagnostics + sampling + feedback loop—turns an LLM from a “text generator” into an engineered system you can measure, monitor, and improve like any serious production service.


Appendix: Reusable Mermaid snippets

A) System architecture

B) Eval taxonomy

C) Feedback loop

Agentic Job Pre-Screening with LangGraph + DeepSeek: Auto-Reject Fake “Remote” Roles

· 7 min read
Vadim Nicolai
Senior Software Engineer

Introduction

Remote job postings are noisy, inconsistent, and often misleading. A role is labeled “Remote”, but the actual constraints show up in one sentence buried halfway down the description:

  • “Remote (US only)”
  • “Must be authorized to work in the U.S. without sponsorship”
  • “EU/EEA only due to payroll constraints”
  • “Must overlap PST business hours”
  • “Hybrid, 2 days/week in-office”

This article breaks down a LangGraph System that pre-screens job postings using DeepSeek structured extraction, then applies deterministic rules to instantly decide:

✅ Apply
❌ Reject (with reasons + quotes)

The goal is simple: filter out non-viable jobs before you spend time applying.


The Problem: “Remote” Doesn’t Mean “Work From Anywhere”

Why Traditional Filters Fail

Keyword filters (“remote”, “anywhere”) fail because job descriptions are written inconsistently and constraints can be phrased in dozens of ways:

  1. Remote but country-restricted
  2. Remote but timezone-restricted
  3. Remote but payroll-limited
  4. Remote but no visa sponsorship
  5. Remote but actually hybrid

Instead of relying on fragile string matching, we use an LLM to read the description like a human, but output machine-usable constraints.


System Overview

This agent evaluates job postings in two phases:

  1. Analyze job text (DeepSeek + structured schema)
  2. Check eligibility (deterministic rules)

What It Detects

  • Location scope
    • US-only / EU-only / Global / Specific regions / Unknown
  • Remote status
    • fully-remote / remote-with-restrictions / hybrid / on-site / unknown
  • Visa sponsorship
    • explicit yes/no/unknown
  • Work authorization requirements
    • must be authorized in US/EU, or not specified
  • Timezone restrictions
    • PST overlap / CET overlap / etc.

Tech Stack

  • LangGraph: workflow orchestration and state transitions
  • DeepSeek: high-signal extraction from messy job text (deepseek-chat)
  • LangChain structured output: strict schema → stable parsing
  • Deterministic rules engine: eligibility enforcement without “LLM vibes”

Architecture Patterns

1) LangGraph Workflow

Instead of a linear script, the system is a graph-driven workflow:

This shape is production-friendly because the workflow can expand safely:

  • add salary checks
  • add tech stack fit scoring
  • add seniority mismatch detection
  • add contractor vs employee constraints

Typed State + Structured Extraction

State Model (TypedDict)

LangGraph becomes far more reliable when state is explicit:

class JobScreeningState(TypedDict):
job_title: str
company: str
description: str
location: str
url: str

# Candidate requirements
candidate_needs_visa_sponsorship: bool
requires_fully_remote: bool
requires_worldwide_remote: bool
candidate_locations: List[str]

# Output
is_eligible: bool
rejection_reasons: List[str]

# Extracted requirements
location_requirement: Optional[str]
specific_regions: List[str]
excluded_regions: List[str]
visa_sponsorship_available: Optional[bool]
work_authorization_required: Optional[str]
remote_status: str
timezone_restrictions: List[str]
confidence: str
key_phrases: List[str]
analysis_explanation: Optional[str]

DeepSeek Extraction: Converting Messy Text Into Policy Constraints

Why Structured Output Is Non-Negotiable

Freeform LLM output is fragile. A production system needs predictable extraction. This agent forces DeepSeek into a strict schema:

class JobAnalysisSchema(TypedDict):
location_requirement: Literal["US-only", "EU-only", "Global", "Specific-regions", "Unknown"]
specific_regions: List[str]
excluded_regions: List[str]
remote_status: Literal["fully-remote", "remote-with-restrictions", "hybrid", "on-site", "unknown"]
visa_sponsorship_available: Optional[bool]
work_authorization_required: Literal["US-only", "EU-only", "Any", "Unknown"]
timezone_restrictions: List[str]
confidence: Literal["high", "medium", "low"]
key_phrases: List[str]
explanation: str

With this contract, the agent can safely feed extracted requirements into deterministic logic.


Token Efficiency: Keep Only High-Signal Lines

Job descriptions are long. Constraints are usually short. To reduce tokens and improve extraction precision, the system trims input to keyword-adjacent lines:

KEYWORDS = (
"remote", "anywhere", "worldwide", "timezone", "sponsor", "visa",
"authorized", "work authorization", "must be located", "eligible to work",
"location", "region", "country", "overlap", "hours", "time zone"
)

def _keep_relevant(text: str, window: int = 2) -> str:
lines = text.splitlines()
keep = set()
for i, ln in enumerate(lines):
if any(k in ln.lower() for k in KEYWORDS):
for j in range(max(0, i - window), min(len(lines), i + window + 1)):
keep.add(j)
return "\n".join(lines[i] for i in sorted(keep)) or text

This improves the system in four ways:

  • lower inference cost
  • faster runtime
  • less noise
  • fewer hallucination opportunities

Heuristics + DeepSeek: Hybrid Extraction That Wins

Before invoking DeepSeek, the system runs a tiny heuristic pre-check:

  • detects obvious “Remote (Worldwide)”
  • detects “Remote (US only)”
  • detects “Hybrid / On-site”
def _fast_heuristic_precheck(state: JobScreeningState) -> Optional[Dict[str, Any]]:
loc = state.get("location", "") or ""
desc = state.get("description", "") or ""
seed: Dict[str, Any] = {}

if _looks_worldwide(loc) or _looks_worldwide(desc):
seed["location_requirement"] = "Global"
seed["remote_status"] = "fully-remote"

if (_looks_us_only(loc) or _looks_us_only(desc)) and not seed.get("location_requirement"):
seed["location_requirement"] = "US-only"

if _looks_hybrid_or_onsite(loc):
seed["remote_status"] = "hybrid"

return seed if seed else None

DeepSeek still performs the full extraction, but seeding improves resilience against incomplete metadata.


Eligibility Rules: Enforcing Worldwide Remote Strictly

The most valuable mode is strict worldwide remote filtering:

If requires_worldwide_remote=True, the job must satisfy ALL of the following:

  • remote_status == "fully-remote"
  • location_requirement == "Global"
  • no specific_regions
  • no timezone_restrictions
if state["requires_worldwide_remote"]:
if state["remote_status"] != "fully-remote":
rejection_reasons.append(
f"Not worldwide-remote: remote status is '{state['remote_status']}'"
)
if state["location_requirement"] != "Global":
rejection_reasons.append(
f"Not worldwide-remote: location requirement is '{state['location_requirement']}'"
)
if state["specific_regions"]:
rejection_reasons.append(
f"Not worldwide-remote: restricted to {state['specific_regions']}"
)
if state["timezone_restrictions"]:
rejection_reasons.append(
f"Not worldwide-remote: timezone restrictions {state['timezone_restrictions']}"
)

This instantly rejects “remote marketing” jobs like:

  • “Remote, EU only”
  • “Remote, US/Canada preferred”
  • “Remote, PST overlap required”

Visa Sponsorship Semantics: Correct and Safe

Sponsorship logic is easy to get wrong. The correct behavior:

  • reject only when sponsorship is explicitly not available (False)
  • do not reject on unknown (None)
if state["candidate_needs_visa_sponsorship"]:
if state["visa_sponsorship_available"] is False:
rejection_reasons.append(
"Job does not offer visa sponsorship, but candidate needs sponsorship"
)

This avoids dropping jobs that simply don’t mention sponsorship.


Explainability: Rejection Reasons + Key Phrases

Trust requires receipts. The system stores:

  • rejection_reasons (deterministic outcomes)
  • key_phrases (quotes that triggered the decision)
  • analysis_explanation (LLM summary for debugging)

That produces outputs like:

  • “Job requires US location; candidate is not in US”
  • “Not worldwide-remote: timezone restrictions ['US Pacific business hours']”
  • key phrases like “Must be authorized to work in the U.S. without sponsorship”

Real-World Test Scenarios

The included test suite covers the most common job board traps:

  1. US-only remote + no sponsorship
  2. Remote worldwide (work from anywhere)
  3. EU-only remote
  4. Remote with timezone overlap requirement

This validates both extraction quality and deterministic enforcement.


Production Enhancements

1) Add a Match Score (Not Only Pass/Fail)

Binary decisions are clean, but scoring improves ranking:

  • 100 = perfect match
  • 70 = acceptable
  • 30 = not worth it
  • 0 = reject

2) Cache Results by URL Hash

You already compute a stable thread_id from the job URL. Persist results keyed by:

  • url_hash
  • model version
  • rule version

This prevents re-analyzing duplicate postings.

3) Detect Payroll Constraints Explicitly

Add signals for:

  • “We can only hire in countries where we have an entity”
  • “Deel/Remote.com limited coverage”
  • “W2 only / no contractors”

This is one of the highest ROI improvements for global applicants.


Conclusion

This LangGraph System turns job descriptions into enforceable constraints:

  • DeepSeek extracts remote reality, location scope, and sponsorship signals
  • Structured output makes extraction stable and machine-safe
  • Deterministic rules enforce candidate requirements precisely
  • Worldwide-remote mode filters out fake “remote” listings instantly
  • Decisions are explainable with reasons and quotes

This is how you scale job hunting without wasting time: automate rejection early, apply only where it can actually work.

References