Skip to main content

Building a Job Search Agent with Google ADK and Search Grounding

· 8 min read
Vadim Nicolai
Senior Software Engineer

Introduction

This article documents the implementation of an AI-powered job search agent using Google’s Agent Development Kit (ADK) with Google Search grounding. The agent demonstrates how to build intelligent search applications that combine the reasoning capabilities of large language models with real-time web search data.

What is Google ADK?

Google’s Agent Development Kit (ADK) is a TypeScript framework for building AI agents powered by Gemini models. Released in late 2025, ADK provides:

  • Agent orchestration — Define agents with specific tools and instructions
  • Tool integration — Built-in tools like GOOGLE_SEARCH for grounding
  • Session management — Persistent conversation state across interactions
  • Event streaming — Real-time access to agent execution events
  • Grounding metadata — Access to search sources, citations, and web queries

Key features

import { LlmAgent, GOOGLE_SEARCH } from '@google/adk';

const agent = new LlmAgent({
name: 'searchAgent',
model: 'gemini-2.5-flash',
tools: [GOOGLE_SEARCH],
instruction: 'Your detailed instructions here...'
});

The ADK’s GOOGLE_SEARCH tool enables grounding—anchoring LLM responses in real web search results rather than relying solely on training data.

The search agent architecture

High-level flow (event stream + grounding)

Top-down diagrams

Top-down architecture (layered)

Top-down data flow (what moves where)

Agent design

Our job search agent is built with a clear purpose: find remote AI consulting opportunities using structured search strategies.

File: /src/google/search-agent.ts

export const searchAgent = new LlmAgent({
name: 'jobSearchAgent',
model: 'gemini-2.5-flash',
tools: [GOOGLE_SEARCH],
instruction: `You are an expert job search assistant specializing in AI/GenAI roles...

ALWAYS perform Google Search first to get real-time job listings.
Extract actual job data from search results...

Return structured JSON with job listings...`
});

Agent instructions

The agent is instructed to:

  1. Execute targeted searches — Use specific boolean search operators
  2. Extract structured data — Parse job listings into JSON format
  3. Apply filters — Focus on last 7 days, remote-only, consultancy roles
  4. Include metadata — Company details, remote scope, EU eligibility

Search strategy

The agent generates six complementary search queries:

const queries = [
'fully remote "AI consultant" agency OR consultancy "client-facing" RAG OR LLM (last 7 days)',
'fully remote "generative AI engineer" agency "client delivery" RAG OR agents (last 7 days)',
'fully remote "LLM architect" consultancy "client engagement" RAG OR agents (last 7 days)',
'fully remote "AI solutions architect" consultancy "client facing" (last 7 days)',
'remote "AI delivery manager" consultancy LLM OR GenAI (last 7 days)',
'fully remote "AI specialist" consulting "client projects" EU (last 7 days)'
];

This multi-query approach captures:

  • Different role titles (consultant, engineer, architect, specialist)
  • Various technical focuses (RAG, agents, LLM)
  • Client-facing delivery emphasis
  • Recent postings only (7-day window)

Implementation details

Runner architecture

File: /src/google/runner.ts

The runner manages agent execution and extracts grounding metadata:

export async function runSearchAgent(
prompt: string,
options: { saveToFile?: boolean; outputDir?: string } = {}
): Promise<void> {
const sessionService = new InMemorySessionService();
const sessionId = `session-${Date.now()}`;

const runner = new Runner(rootAgent, sessionService);

for await (const event of runner.runGenerateContent(sessionId, { text: prompt })) {
if (isFinalResponse(event)) {
const content = stringifyContent(event.content);
const grounding = getGroundingArrays(event);

// Process and save results
}
}
}

Grounding metadata extraction

The critical feature is extracting grounding metadata—the search sources and citation mappings:

function getGroundingArrays(event: any): GroundingData {
const metadata = event.groundingMetadata;

return {
groundingChunks: metadata?.groundingChunks || [],
groundingSupports: metadata?.groundingSupports || [],
webSearchQueries: metadata?.webSearchQueries || [],
searchEntryPoint: metadata?.searchEntryPoint
};
}

This provides:

  • groundingChunks — Source URLs and titles
  • groundingSupports — Citation mappings linking response segments to sources
  • webSearchQueries — The actual search queries executed by Google
  • searchEntryPoint — HTML content with Google Search suggestions

Citation integration

The runner adds citation markers [1,2,3] to the agent’s response:

function addCitationMarkersToSegment(
text: string,
supports: any[]
): string {
const indices = supports
.flatMap(s => s.groundingChunkIndices || [])
.filter((v, i, a) => a.indexOf(v) === i)
.map(i => i + 1);

return indices.length > 0
? `${text}[${indices.join(',')}]`
: text;
}

JSON output

Results are saved to timestamped JSON files:

const output = {
timestamp: new Date().toISOString(),
query: prompt,
rawResponse: content,
data: parsedData,
summary: summary,
webSearchQueries: grounding.webSearchQueries,
sources: sources,
renderedContent: grounding.searchEntryPoint?.renderedContent
};

fs.writeFileSync(
path.join(outputDir, `job-search-results-${timestamp}.json`),
JSON.stringify(output, null, 2)
);

Running the agent

Execution script

File: /scripts/run-google-search-agent.ts

import dotenv from 'dotenv';
import { runSearchAgent } from '../src/google/index';

// Load environment variables
dotenv.config({ path: '.env' });
dotenv.config({ path: '.env.local', override: true });

// Validate API key
if (!process.env.GEMINI_API_KEY) {
console.error('❌ Error: GEMINI_API_KEY not found');
process.exit(1);
}

// Execute search
const prompt = `Find 10 fully-remote AI / GenAI roles at agencies or consultancies.
Prefer client-facing delivery roles, especially those involving RAG, agents, or LLM implementation.
Include EU eligibility information.`;

runSearchAgent(prompt, { saveToFile: true, outputDir: './results' });
npx tsx scripts/run-google-search-agent.ts

Results and grounding analysis

Search execution

The agent successfully executed 6 web search queries and identified 10 distinct job sources:

  1. remoterocketship.com — Remote-first job platform
  2. pinpointhq.com — Tech recruitment
  3. yutori.com — Remote opportunities
  4. djinni.co — Tech talent marketplace
  5. talent.com — Global job aggregator
  6. shine.com — Job search platform
  7. shiza.ai — AI-specialized job board
  8. studysmarter.co.uk — Career resources (3 matches)

Grounding metadata

The response included rich grounding data:

Web Search Queries Executed:

[
"fully remote \"AI consultant\" agency OR consultancy OR \"professional services\" \"client-facing\" RAG OR LLM OR \"generative AI\" (last 7 days)",
"fully remote \"generative AI engineer\" agency OR consultancy OR \"professional services\" \"client delivery\" RAG OR LLM OR agents (last 7 days)",
"fully remote \"LLM architect\" consultancy \"client engagement\" RAG OR agents (last 7 days)",
"fully remote \"AI solutions architect\" consultancy \"client facing\" (last 7 days)",
"remote \"AI delivery manager\" consultancy LLM OR GenAI (last 7 days)",
"fully remote \"AI specialist\" consulting \"client projects\" EU (last 7 days)"
]

Source Citations:

[
{
"index": 1,
"title": "remoterocketship.com",
"url": "https://vertexaisearch.cloud.google.com/grounding-api-redirect/..."
}
]

Search Entry Point: The renderedContent field contains Google Search’s HTML with interactive search chips, providing users with clickable query refinements.

Technical challenges and lessons

Challenge 1: Empty responses

Problem: Initial implementation returned content: null despite successful execution.

Root cause: Using AgentTool to wrap another agent with GOOGLE_SEARCH created conflicts.

Solution: Use GOOGLE_SEARCH directly as the sole tool in the agent—ADK requires grounding tools to be the only tool configured.

Challenge 2: Structured data extraction

Problem: Agent returned Google Search HTML suggestions instead of extracted job data.

Solution: Enhanced instructions with explicit output format requirements and processing rules:

instruction: `
Rules:
1. ALWAYS perform Google Search first
2. Extract actual job data from search results
3. Format as JSON with fields: title, company, remote_scope, url, posted_date, etc.
4. Search for jobs posted within the LAST 7 DAYS only
`

Challenge 3: API key configuration

Problem: GEMINI_API_KEY not found in environment.

Solution: Support multiple .env files with precedence:

dotenv.config({ path: '.env' });
dotenv.config({ path: '.env.local', override: true });

Best practices

1. Single tool limitation

When using GOOGLE_SEARCH, it must be the only tool in the agent:

// ✅ Correct
const agent = new LlmAgent({
tools: [GOOGLE_SEARCH]
});

// ❌ Incorrect - will cause issues
const agent = new LlmAgent({
tools: [GOOGLE_SEARCH, CODE_EXECUTION, someCustomTool]
});

2. Explicit instructions

Provide clear, structured instructions for the LLM:

instruction: `
Purpose: [What the agent does]

Process:
1. [Step 1]
2. [Step 2]

Output Format:
{
"field": "description"
}

Rules:
- [Rule 1]
- [Rule 2]
`

3. Grounding metadata access

Access grounding data from the final response event:

for await (const event of runner.runGenerateContent(sessionId, prompt)) {
if (isFinalResponse(event)) {
const grounding = event.groundingMetadata;
const chunks = grounding?.groundingChunks || [];
}
}

4. Event stream debugging

Log event types to understand agent execution flow:

for await (const event of runner.runGenerateContent(...)) {
console.log('Event:', {
finishReason: event.finishReason,
hasContent: !!event.content,
hasGrounding: !!event.groundingMetadata
});
}

Conclusion

Google’s ADK provides a powerful framework for building grounded AI agents. Key takeaways:

  • Grounding is essential — Real-time search data makes agents more accurate and current
  • Citations add credibility — Extracted source URLs enable verification
  • Event streams enable debugging — Real-time event monitoring aids development
  • Instructions matter — Clear, structured prompts yield better results
  • Multi-query strategies work — Complementary searches capture diverse results

The job search agent demonstrates how ADK can transform a simple search task into an intelligent, automated workflow with structured outputs and verifiable sources.

Technical stack

  • @google/adk v0.3.0 — Agent Development Kit
  • gemini-2.5-flash — LLM model with grounding support
  • TypeScript — Type-safe agent development
  • Node.js 24+ — Runtime environment
  • tsx — TypeScript execution
  • dotenv — Environment configuration

Repository structure

/src/google/
search-agent.ts # Agent definition with GOOGLE_SEARCH
runner.ts # Execution logic and grounding extraction
index.ts # Public exports
README.md # Setup and usage documentation

/scripts/
run-google-search-agent.ts # Executable runner script

/results/
job-search-results-{timestamp}.json # Output files

Streaming OpenAI TTS to Cloudflare R2

· 11 min read
Vadim Nicolai
Senior Software Engineer

This article documents a production implementation of OpenAI's Text-to-Speech (TTS) API with automatic chunking for long-form content and seamless upload to Cloudflare R2 storage.

Architecture Overview

The system provides two API entrypoints for audio generation:

  1. GraphQL Mutation (generateOpenAIAudio) — used by the main app for story audio
  2. REST API (/api/tts) — provides flexible streaming options and direct upload

Both endpoints support:

  • Automatic text chunking for content exceeding 4000 characters
  • Audio merging for seamless playback of long content
  • Cloudflare R2 upload with public CDN URLs
  • Base64 fallback for immediate playback while uploading
  • Metadata tracking (duration, voice, model, etc.)

OpenAI TTS Integration

Voice Selection

Defaults to onyx but supports all OpenAI TTS voices:

  • alloy, ash, ballad, coral, echo, fable
  • onyx (default), nova, sage, shimmer
  • verse, marin, cedar

Model Selection

Supports three models:

  • gpt-4o-mini-tts (default) — fast, efficient, high quality
  • tts-1 — standard quality
  • tts-1-hd — high definition audio

Audio Formats

Supports multiple output formats:

  • mp3 (default) — best compatibility
  • opus, aac, flac, wav, pcm

Text Chunking for Long Content

OpenAI TTS has a 4096 character limit. This implementation uses Mastra RAG’s recursive chunking strategy to intelligently split long text.

import { MDocument } from "@mastra/rag";

const MAX_CHARS = 4000; // Buffer below OpenAI's 4096 limit

async function chunkTextForSpeech(text: string): Promise<string[]> {
const doc = MDocument.fromText(text);

const chunks = await doc.chunk({
strategy: "recursive",
maxSize: MAX_CHARS,
overlap: 50,
separators: ["\n\n", "\n", ". ", "! ", "? "],
});

return chunks.map((chunk) => chunk.text);
}

Key features:

  • Respects paragraph breaks (\n\n)
  • Falls back to sentence boundaries (., !, ?)
  • 50-character overlap prevents awkward breaks
  • Maintains narrative flow across chunks

Audio Merging

When text is chunked, each piece is converted to audio separately, then merged into a single file:

// Generate audio for each chunk
const audioChunks: Buffer[] = [];

for (const chunk of chunks) {
const response = await openai.audio.speech.create({
model: "gpt-4o-mini-tts",
voice: "onyx",
input: chunk,
response_format: "mp3",
speed: 0.9,
});

const buffer = Buffer.from(await response.arrayBuffer());
audioChunks.push(buffer);
}

// Combine into single audio file
const combined = Buffer.concat(audioChunks);

Why merge?

  • Single file = simpler playback
  • No gaps between chunks
  • Easier to upload and share
  • Better browser compatibility

Cloudflare R2 Upload

R2 Client Setup

Uses AWS SDK v3 with Cloudflare R2 endpoints:

import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";

const r2Client = new S3Client({
region: "auto",
endpoint: `https://${R2_ACCOUNT_ID}.r2.cloudflarestorage.com`,
credentials: {
accessKeyId: R2_ACCESS_KEY_ID,
secretAccessKey: R2_SECRET_ACCESS_KEY,
},
});

Upload Function

export async function uploadToR2(options: {
key: string;
body: Buffer;
contentType?: string;
metadata?: Record<string, string>;
}): Promise<{
key: string;
publicUrl: string | null;
bucket: string;
sizeBytes: number;
}> {
const { key, body, contentType = "audio/mpeg", metadata = {} } = options;

await r2Client.send(
new PutObjectCommand({
Bucket: R2_BUCKET_NAME,
Key: key,
Body: body,
ContentType: contentType,
Metadata: metadata,
}),
);

const publicUrl = R2_PUBLIC_DOMAIN ? `${R2_PUBLIC_DOMAIN}/${key}` : null;

return {
key,
publicUrl,
bucket: R2_BUCKET_NAME,
sizeBytes: body.length,
};
}

Key Generation

Unique keys with timestamps prevent collisions:

export function generateAudioKey(prefix?: string): string {
const timestamp = Date.now();
const random = Math.random().toString(36).substring(2, 15);
return `${prefix ? `${prefix}/` : ""}audio-${timestamp}-${random}.mp3`;
}

Example output: graphql-tts/audio-1707839472156-9k2jd8x4.mp3

Metadata Extraction

The system calculates audio duration using music-metadata:

import { parseBuffer } from "music-metadata";

let duration: number | null = null;
try {
const metadata = await parseBuffer(audioBuffer, {
mimeType: "audio/mp3",
});
duration = metadata.format.duration || null;
} catch (error) {
console.warn("Failed to parse audio duration:", error);
}

Metadata stored with upload:

  • voice — TTS voice used
  • model — OpenAI model
  • textLength — original text length
  • chunks — number of chunks (if split)
  • generatedBy — user email
  • instructions — custom TTS instructions (optional)

GraphQL Implementation

Schema Definition

input GenerateOpenAIAudioInput {
text: String!
storyId: Int
voice: OpenAITTSVoice
model: OpenAITTSModel
speed: Float
responseFormat: OpenAIAudioFormat
uploadToCloud: Boolean
instructions: String
}

type GenerateOpenAIAudioResult {
success: Boolean!
message: String
audioBuffer: String
audioUrl: String
key: String
sizeBytes: Int
duration: Float
}

type Mutation {
generateOpenAIAudio(input: GenerateOpenAIAudioInput!): GenerateOpenAIAudioResult!
}

Resolver Implementation

export const generateOpenAIAudio: MutationResolvers["generateOpenAIAudio"] = async (
_parent,
args,
ctx,
) => {
const userEmail = ctx.userEmail;
if (!userEmail) {
throw new Error("Authentication required");
}

const {
text,
storyId,
voice = "ONYX",
model = "GPT_4O_MINI_TTS",
speed = 0.9,
responseFormat = "MP3",
uploadToCloud,
} = args.input;

// Map GraphQL enums to OpenAI API values
const openAIVoice = voice.toLowerCase();
const openAIModel = model === "GPT_4O_MINI_TTS" ? "gpt-4o-mini-tts" : "tts-1";
const format = responseFormat.toLowerCase();

// Handle chunking if needed
if (text.length > MAX_CHARS) {
const chunks = await chunkTextForSpeech(text);
const audioChunks: Buffer[] = [];

for (const chunk of chunks) {
const response = await openai.audio.speech.create({
model: openAIModel,
voice: openAIVoice,
input: chunk,
response_format: format,
speed,
});

audioChunks.push(Buffer.from(await response.arrayBuffer()));
}

const combined = Buffer.concat(audioChunks);

// Upload to R2
if (uploadToCloud) {
const key = generateAudioKey("graphql-tts");
const result = await uploadToR2({
key,
body: combined,
contentType: `audio/${format}`,
metadata: {
voice: openAIVoice,
model: openAIModel,
textLength: text.length.toString(),
chunks: chunks.length.toString(),
generatedBy: userEmail,
},
});

// Save to story if provided
if (storyId) {
await saveAudioToStory(storyId, result.key, result.publicUrl, userEmail);
}

return {
success: true,
message: `Audio generated from ${chunks.length} chunks and uploaded to R2`,
audioBuffer: combined.toString("base64"),
audioUrl: result.publicUrl,
key: result.key,
sizeBytes: result.sizeBytes,
duration: audioDuration,
};
}

return {
success: true,
message: `Audio generated from ${chunks.length} chunks`,
audioBuffer: combined.toString("base64"),
audioUrl: null,
key: null,
sizeBytes: combined.length,
duration: audioDuration,
};
}

// ... handle short text similarly
};

Client Usage

import { useGenerateOpenAiAudioMutation } from "@/app/__generated__/hooks";

const [generateAudio, { loading }] = useGenerateOpenAiAudioMutation();

async function handleTextToSpeech() {
const result = await generateAudio({
variables: {
input: {
text: storyContent,
storyId,
voice: OpenAittsVoice.Onyx,
model: OpenAittsModel.Gpt_4OMiniTts,
speed: 0.9,
responseFormat: OpenAiAudioFormat.Mp3,
uploadToCloud: true,
},
},
});

const payload = result.data?.generateOpenAIAudio;
if (payload?.success && payload.audioUrl) {
// Use public R2 URL
setAudioSrc(payload.audioUrl);
} else if (payload?.audioBuffer) {
// Fallback to base64
const blob = base64ToBlob(payload.audioBuffer);
setAudioSrc(blob);
}
}

REST API Implementation

Streaming Options

The /api/tts endpoint supports multiple streaming modes:

export async function POST(request: NextRequest) {
const {
text,
voice = "alloy",
uploadToCloud,
streamFormat, // "sse" for Server-Sent Events
instructions,
} = await request.json();

// For short text with SSE streaming
if (streamFormat === "sse" && !uploadToCloud) {
const response = await openai.audio.speech.create({
model: "gpt-4o-mini-tts",
voice,
input: text,
response_format: "mp3",
speed: 0.9,
stream_format: "sse",
});

// Create SSE stream
const stream = new ReadableStream({
async start(controller) {
const reader = response.body?.getReader();
if (!reader) {
controller.close();
return;
}

while (true) {
const { done, value } = await reader.read();
if (done) break;
controller.enqueue(value);
}
controller.close();
},
});

return new NextResponse(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}

// Standard audio stream
const response = await openai.audio.speech.create({
model: "gpt-4o-mini-tts",
voice,
input: text,
response_format: "mp3",
speed: 0.9,
});

return new NextResponse(response.body, {
headers: {
"Content-Type": "audio/mp3",
"Cache-Control": "no-cache",
"Transfer-Encoding": "chunked",
},
});
}

Client Usage (Fetch API)

// Upload to R2
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: storyContent,
voice: "onyx",
uploadToCloud: true,
storyId: 123,
userEmail: "user@example.com",
}),
});

const data = await response.json();
// { success: true, audioUrl: "https://tts.yourdomain.com/audio-123456.mp3", key: "...", sizeBytes: 234567 }

// Direct streaming
const response2 = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: "Hello world",
voice: "onyx",
}),
});

const blob = await response2.blob();
const audioUrl = URL.createObjectURL(blob);
audioElement.src = audioUrl;

Database Integration

Audio metadata is saved to the database when a storyId is provided:

async function saveAudioToStory(
storyId: number,
audioKey: string,
audioUrl: string | null,
userEmail: string,
): Promise<void> {
const now = new Date().toISOString();
await turso.execute({
sql: `UPDATE stories
SET audio_key = ?, audio_url = ?, audio_generated_at = ?, updated_at = ?
WHERE id = ? AND user_id = ?`,
args: [audioKey, audioUrl || "", now, now, storyId, userEmail],
});
}

Database schema (SQLite):

CREATE TABLE stories (
id INTEGER PRIMARY KEY,
goal_id INTEGER NOT NULL,
user_id TEXT NOT NULL,
content TEXT NOT NULL,
audio_key TEXT,
audio_url TEXT,
audio_generated_at TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);

Configuration

Environment Variables

# OpenAI
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

# Cloudflare R2
R2_ACCOUNT_ID=your-account-id-here
R2_ACCESS_KEY_ID=your-access-key-id-here
R2_SECRET_ACCESS_KEY=your-secret-access-key-here
R2_BUCKET_NAME=longform-tts
R2_PUBLIC_DOMAIN=https://tts.yourdomain.com

R2 Bucket Setup

  1. Create R2 Bucket: In Cloudflare dashboard, create bucket named longform-tts

  2. Generate API Token: Create an API token with R2 read/write permissions

  3. Configure Custom Domain (optional):

    • Add custom domain in R2 bucket settings
    • Point DNS to R2 bucket URL
    • Enable public access for the bucket
  4. CORS Configuration (if accessing from browser):

{
"AllowedOrigins": ["https://your-app.com"],
"AllowedMethods": ["GET", "HEAD"],
"AllowedHeaders": ["*"],
"ExposeHeaders": ["Content-Length", "Content-Type"],
"MaxAgeSeconds": 3600
}

Performance Considerations

Text Chunking

  • 4000 char limit: safe buffer below OpenAI’s 4096
  • 50 char overlap: prevents awkward sentence breaks
  • Recursive splitting: maintains natural paragraph/sentence flow

Audio Merging

  • In-memory concat: fast for typical story lengths (< 50KB per chunk)
  • Buffer pooling: efficient memory usage with Buffer.concat()
  • No intermediate files: everything happens in memory

R2 Upload

  • Direct buffer upload: no filesystem writes
  • Parallel processing: upload doesn’t block audio generation
  • Public CDN URLs: instant global availability
  • Cost: ~0.015/GBstorage,0.015/GB storage, 0.00/GB egress (first 10GB/month)

Metadata Parsing

  • Optional duration: parse only if needed
  • Non-blocking: failures logged but don’t break flow
  • File size tracking: useful for analytics and storage optimization

Error Handling

try {
const result = await generateOpenAIAudio({ ... });

if (!result.success) {
// Handle OpenAI API errors
console.error(result.message);
return;
}

if (result.audioUrl) {
// Use R2 URL (preferred)
setAudioSrc(result.audioUrl);
} else if (result.audioBuffer) {
// Fallback to base64
const blob = base64ToBlob(result.audioBuffer);
setAudioSrc(URL.createObjectURL(blob));
}
} catch (error) {
// Network or authentication errors
console.error("TTS Error:", error);
}

Common errors:

  • 401 Unauthorized: invalid OpenAI API key
  • 429 Rate Limited: too many requests (use exponential backoff)
  • 413 Payload Too Large: text exceeds chunking limits
  • 500 R2 Upload Failed: check R2 credentials and bucket config

Production Checklist

  • OpenAI API key configured
  • R2 credentials set up
  • R2 bucket created and public domain configured
  • CORS configured for R2 bucket
  • Database schema includes audio columns
  • Rate limiting implemented for TTS endpoints
  • Error tracking (Sentry, LogRocket, etc.)
  • Audio duration parsing tested
  • Chunking tested with various text lengths
  • Fallback to base64 tested when R2 unavailable

Cost Estimation

OpenAI TTS Pricing (as of 2024):

  • gpt-4o-mini-tts: ~$1 per 10,000 characters
  • tts-1: ~$15 per 1M characters
  • tts-1-hd: ~$30 per 1M characters

Cloudflare R2 Pricing:

  • Storage: $0.015/GB/month
  • Class A operations (writes): $4.50 per million
  • Class B operations (reads): $0.36 per million
  • Egress: Free (no bandwidth charges)

Example: 1000 stories × 2000 chars each

  • OpenAI cost: 2M chars × 1/10K=1/10K = 200
  • R2 storage: ~200MB × 0.015=0.015 = 0.003/month
  • R2 writes: 1000 × 4.50/1M=4.50/1M = 0.0045

Conclusion

This implementation provides a robust, production-ready solution for converting long-form text to audio with automatic cloud storage. Key benefits:

  1. Handles long content: automatic chunking + merging
  2. Multiple interfaces: GraphQL and REST for flexibility
  3. Dual fallback: R2 URLs + base64 for reliability
  4. Rich metadata: duration, voice, chunks tracked
  5. Cost-effective: Cloudflare R2 eliminates egress fees
  6. Developer-friendly: type-safe GraphQL, clean API

The pattern can be adapted for other use cases like podcast generation, audiobook creation, or voice-enabled content platforms.

Resources

Live Implementation

This implementation is used in production at:

LangSmith Prompt Management

· 13 min read
Vadim Nicolai
Senior Software Engineer

In the rapidly evolving landscape of Large Language Model (LLM) applications, prompt engineering has emerged as a critical discipline. As teams scale their AI applications, managing prompts across different versions, environments, and use cases becomes increasingly complex. This is where LangSmith's prompt management capabilities shine.

LangSmith, developed by LangChain, provides a comprehensive platform for managing, versioning, and collaborating on prompts—effectively bringing software engineering best practices to the world of prompt engineering.


The Challenge of Prompt Management

Why Prompt Management Matters

Prompts are the primary interface between your application and LLMs. As your AI application grows, you'll face several challenges:

  1. Version Control: Tracking changes to prompts over time and understanding what worked and what didn't
  2. Collaboration: Multiple team members working on prompts simultaneously
  3. Environment Management: Different prompts for development, staging, and production
  4. Performance Tracking: Understanding which prompt variations perform best
  5. Reusability: Sharing successful prompts across different projects and teams
  6. Rollback Safety: Ability to quickly revert to previous versions when something breaks

Without proper prompt management, these challenges can lead to:

  • Lost productivity from recreating prompts that were deleted or modified
  • Difficulty debugging when prompt changes cause unexpected behavior
  • Lack of visibility into what prompts are being used in production
  • Inability to A/B test different prompt variations systematically

LangSmith Prompt Management Features

1. Git-Like Version Control

LangSmith treats prompts as versioned artifacts, similar to how Git manages code. Each change to a prompt creates a new commit with a unique hash.

Key Features:

  • Commit History: Every modification is tracked with metadata
  • Diff Viewing: Compare versions to see exactly what changed
  • Rollback: Revert to any previous version instantly
  • Branching: Work on experimental prompts without affecting production
// Example from the codebase
export async function fetchLangSmithPromptCommit(
promptIdentifier: string,
options?: { includeModel?: boolean },
): Promise<LangSmithPromptCommit> {
const client = getLangSmithClient();

const commit = await client.pullPromptCommit(promptIdentifier, options);

return {
owner: commit.owner,
promptName: commit.repo,
commitHash: commit.commit_hash,
manifest: commit.manifest,
examples: commit.examples,
};
}

2. Collaborative Prompt Repositories

LangSmith organizes prompts into repositories, enabling team collaboration:

  • Public Prompts: Share prompts with the broader community
  • Private Prompts: Keep proprietary prompts within your organization
  • Social Features: Like, download, and fork popular prompts
  • Discovery: Search and explore prompts created by others
// Listing prompts with filters
export async function listLangSmithPrompts(options?: {
isPublic?: boolean;
isArchived?: boolean;
query?: string;
}): Promise<LangSmithPrompt[]> {
const client = getLangSmithClient();
const prompts: LangSmithPrompt[] = [];

for await (const prompt of client.listPrompts(options)) {
prompts.push({
id: prompt.id,
fullName: prompt.full_name,
isPublic: prompt.is_public,
tags: prompt.tags,
numLikes: prompt.num_likes,
numDownloads: prompt.num_downloads,
// ... other metadata
});
}

return prompts;
}

3. Rich Metadata and Organization

Prompts in LangSmith can be enriched with metadata:

  • Tags: Categorize prompts by use case, domain, or team
  • Descriptions: Document the purpose and usage
  • README: Provide comprehensive documentation
  • Examples: Include sample inputs and outputs
type LangSmithPrompt {
id: String!
promptHandle: String!
fullName: String!
description: String
readme: String
tags: [String!]!
numCommits: Int!
lastCommitHash: String
createdAt: String!
updatedAt: String!
}

4. User Ownership and Access Control

LangSmith implements robust ownership and permission models:

export function ensureUserPromptIdentifier(
promptIdentifier: string,
userEmail: string,
): string {
const userHandle = toUserHandle(userEmail);

// Prompts are namespaced by owner
if (promptIdentifier.includes("/")) {
const [owner, ...rest] = promptIdentifier.split("/");

// Validate owner matches user
if (owner !== userHandle) {
throw new Error(`Prompt identifier owner does not match your handle`);
}
return promptIdentifier;
}

// Add user's handle as prefix
return `${userHandle}/${promptIdentifier}`;
}

5. Integration with LLM Workflows

LangSmith prompts integrate seamlessly into your application:

// Create a prompt
const prompt = await createLangSmithPrompt("user/customer-support-classifier", {
description: "Classifies customer support tickets",
tags: ["support", "classification", "production"],
isPublic: false,
});

// Push a new version
await pushLangSmithPrompt("user/customer-support-classifier", {
object: {
template: "Classify this ticket: {ticket_text}",
model: "gpt-4",
temperature: 0.2,
},
description: "Improved classification accuracy",
});

Best Practices for Prompt Management

1. Version Your Prompts Deliberately

Treat prompt changes with the same rigor as code changes:

  • Semantic Versioning: Use meaningful version identifiers
  • Commit Messages: Write clear descriptions of what changed and why
  • Test Before Merging: Validate prompts in development before promoting to production

2. Use Tags Strategically

Tags are powerful for organization and filtering:

const productionTags = [
"env:production",
"use-case:classification",
"model:gpt-4",
"team:support",
"user:alice@example.com",
];

Recommended Tag Categories:

  • Environment: env:dev, env:staging, env:prod
  • Use Case: use-case:summarization, use-case:extraction
  • Model: model:gpt-4, model:claude-3
  • Owner: owner:team-name, user:email@domain.com
  • Status: status:experimental, status:stable

3. Document Your Prompts

Good documentation is essential for team collaboration:

await createLangSmithPrompt("user/email-generator", {
description: "Generates professional email responses",
readme: `
# Email Generator Prompt

## Purpose
Generates professional, context-aware email responses for customer inquiries.

## Input Format
- customer_name: String
- inquiry_type: "support" | "sales" | "billing"
- context: String (previous conversation)

## Output Format
Professional email in plain text

## Performance Notes
- Works best with GPT-4
- Temperature: 0.7 for natural variation
- Max tokens: 500

## Examples
See attached examples for common scenarios.
`,
tags: ["communication", "email", "production"],
});

4. Implement User-Based Filtering

When building multi-tenant applications, filter prompts by user:

// From the GraphQL resolver
const userPrompts = allPrompts.filter((prompt) => {
// Check user tags
const hasUserTag = prompt.tags.some(
(tag) =>
tag.includes(`user:${context.userEmail}`) ||
tag.includes(`owner:${context.userEmail}`),
);

// Check if owner matches
const isOwner = prompt.owner === context.userEmail;

return hasUserTag || isOwner;
});

5. Handle Permissions Gracefully

LangSmith requires specific API permissions. Handle errors clearly:

export async function createLangSmithPrompt(
promptIdentifier: string,
input?: CreateLangSmithPromptInput,
): Promise<LangSmithPrompt> {
try {
return await client.createPrompt(promptIdentifier, input);
} catch (error: any) {
if (
error?.message?.includes("403") ||
error?.message?.includes("Forbidden")
) {
throw new Error(
"LangSmith API key lacks 'Prompt Engineering' permissions. " +
"Please generate a new API key with Read, Write, AND " +
"Prompt Engineering scopes.",
);
}
throw error;
}
}

6. Use the Hub for Reusable Prompts

LangSmith Hub allows sharing prompts across projects:

  • Fork Popular Prompts: Start with community-tested templates
  • Share Successful Patterns: Contribute back to the community
  • Cross-Project Reuse: Reference the same prompt from multiple applications

Architectural Patterns

Pattern 1: GraphQL Wrapper for Type Safety

Wrap the LangSmith SDK with a GraphQL layer for type-safe client access:

extend type Mutation {
pushLangSmithPrompt(
promptIdentifier: String!
input: PushLangSmithPromptInput
): String!
}

input PushLangSmithPromptInput {
object: JSON
parentCommitHash: String
description: String
tags: [String!]
isPublic: Boolean
}

Benefits:

  • Type safety across frontend and backend
  • Centralized permission checking
  • Consistent error handling
  • Easy to mock for testing

Pattern 2: Singleton Client Pattern

Use a singleton to manage the LangSmith client:

let singleton: Client | null = null;

export function getLangSmithClient(): Client {
if (!singleton) {
const apiKey = process.env.LANGSMITH_API_KEY;
if (!apiKey) {
throw new Error("LANGSMITH_API_KEY required");
}
singleton = new Client({ apiKey });
}
return singleton;
}

Benefits:

  • Single point of configuration
  • Connection pooling
  • Consistent client instance across requests

Pattern 3: User Namespace Enforcement

Automatically namespace prompts by user to prevent conflicts:

export function toUserHandle(userEmail: string): string {
return userEmail
.trim()
.toLowerCase()
.replace(/[^a-z0-9@._-]+/g, "-")
.replace(/@/g, "-at-")
.replace(/\./g, "-");
}

// alice@company.com -> alice-at-company-com/my-prompt
const identifier = `${toUserHandle(userEmail)}/${promptName}`;

Integration with Your Application

Step 1: Environment Setup

# .env.local
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_API_URL=https://api.smith.langchain.com

Step 2: Client Initialization

import { Client } from "langsmith";

const client = new Client({
apiKey: process.env.LANGSMITH_API_KEY,
apiUrl: process.env.LANGSMITH_API_URL,
});

Step 3: Create and Version Prompts

// Initial creation
await client.pushPrompt("my-org/summarizer", {
object: {
_type: "prompt",
input_variables: ["text"],
template: "Summarize: {text}",
},
tags: ["v1", "production"],
isPublic: false,
});

// Update with new version
await client.pushPrompt("my-org/summarizer", {
object: {
_type: "prompt",
input_variables: ["text", "style"],
template: "Summarize in {style} style: {text}",
},
tags: ["v2", "production"],
isPublic: false,
});

Step 4: Retrieve and Use Prompts

// Get latest version
const commit = await client.pullPromptCommit("my-org/summarizer");

// Use the prompt
const template = commit.manifest.template;
const rendered = template
.replace("{text}", article)
.replace("{style}", "concise");

Monitoring and Analytics

LangSmith provides built-in analytics for prompts:

  • Usage Tracking: See how often each prompt is used
  • Performance Metrics: Track latency and success rates
  • Version Comparison: Compare metrics across versions
  • Cost Analysis: Monitor token usage per prompt
interface LangSmithPrompt {
numViews: number;
numDownloads: number;
numLikes: number;
numCommits: number;
lastUsedAt?: string;
}

Common Pitfalls and Solutions

Pitfall 1: Missing Permissions

Problem: API key doesn't have "Prompt Engineering" scope

Solution:

// Always check permissions and provide clear error messages
if (error?.message?.includes("403")) {
throw new Error(
"Generate API key with Prompt Engineering permissions at " +
"https://smith.langchain.com/settings",
);
}

Pitfall 2: Unbounded Listing

Problem: Listing all prompts can be slow or timeout

Solution:

// Limit results and provide pagination
const MAX_PROMPTS = 100;
for await (const prompt of client.listPrompts(options)) {
prompts.push(prompt);
if (prompts.length >= MAX_PROMPTS) break;
}

Pitfall 3: Naming Conflicts

Problem: Multiple users trying to create prompts with same name

Solution:

// Always namespace by user/organization
const namespaced = `${organization}/${promptName}`;
await client.createPrompt(namespaced, options);

Advanced Features

Prompt Playground Integration

LangSmith provides a visual playground for testing prompts:

  1. Edit prompt templates interactively
  2. Test with sample inputs
  3. Compare outputs across models
  4. Iterate quickly without code changes
  5. Save successful variations as new commits

Example Sets

Attach example inputs/outputs to prompts:

await client.pushPrompt("classifier", {
object: promptTemplate,
examples: [
{
inputs: { text: "I love this product!" },
outputs: { sentiment: "positive" },
},
{
inputs: { text: "Terrible experience" },
outputs: { sentiment: "negative" },
},
],
});

Labels for Deployment Stages

Use labels to mark deployment stages:

// Tag specific commits for each environment
await client.updatePrompt("my-prompt", {
tags: ["production", "v2.1.0"],
});

Migration Strategy

Migrating from Hardcoded Prompts

  1. Audit: Identify all prompts in your codebase
  2. Extract: Move prompts to LangSmith
  3. Refactor: Replace hardcoded strings with LangSmith fetches
  4. Test: Validate behavior matches
  5. Deploy: Roll out gradually with feature flags

Example migration:

// Before
const prompt = "Summarize this text: {text}";

// After
const commit = await client.pullPromptCommit("my-org/summarizer");
const prompt = commit.manifest.template;

Migrating from Custom Storage

  1. Export: Extract prompts from your current system
  2. Bulk Create: Use LangSmith API to create prompts
  3. Preserve History: Import version history as commits
  4. Update References: Point code to LangSmith
  5. Deprecate: Phase out old system

Conclusion

LangSmith's prompt management capabilities bring professional software engineering practices to the world of LLM applications. By treating prompts as versioned, collaborative artifacts, teams can:

  • Move Faster: Test and iterate on prompts without fear
  • Collaborate Better: Work together on prompts with clear ownership
  • Deploy Safely: Roll back problematic changes instantly
  • Scale Confidently: Manage hundreds of prompts across projects
  • Share Knowledge: Learn from community-tested patterns

As AI applications grow in complexity, proper prompt management becomes not just a nice-to-have but a necessity. LangSmith provides the infrastructure to manage this critical aspect of your AI stack.


Additional Resources

Langfuse Features: Prompts, Tracing, Scores, Usage

· 11 min read
Vadim Nicolai
Senior Software Engineer

A comprehensive guide to implementing Langfuse features for production-ready AI applications, covering prompt management, tracing, evaluation, and observability.

Overview

This guide covers:

  • Prompt management with caching and versioning
  • Distributed tracing with OpenTelemetry
  • User feedback and scoring
  • Usage tracking and analytics
  • A/B testing and experimentation

OpenRouter Integration with DeepSeek

· 9 min read
Vadim Nicolai
Senior Software Engineer

This article documents the complete OpenRouter integration implemented in Nomadically.work, using DeepSeek models exclusively through a unified API.

Architecture Overview

Module Structure

Core Features

1. Provider Configuration

The provider layer handles OpenRouter API communication using the OpenAI SDK compatibility layer.

Implementation Details:

  • Uses @ai-sdk/openai package for API compatibility
  • Lazy-loaded provider instance to support testing without API key
  • Configurable reasoning tokens (default: 10,000 max_tokens)
  • Custom headers for analytics and tracking

2. DeepSeek Model Access

Five DeepSeek models are available through the integration:

Model Selection Guide:

  • DeepSeek Chat: General-purpose conversations, Q&A, text generation
  • DeepSeek R1: Complex reasoning, multi-step analysis, decision-making
  • DeepSeek Coder: Code generation, debugging, technical documentation
  • R1 Distill Qwen 32B: Faster inference for reasoning tasks
  • R1 Distill Llama 70B: High-quality reasoning with better performance

3. Agent Creation Patterns

Three patterns for creating agents with different levels of abstraction:

Pattern Comparison:

PatternUse CaseFlexibilitySetup Time
TemplatesQuick prototyping, demosLowSeconds
HelpersStandard agents with custom configMediumMinutes
DirectAdvanced use cases, full controlHighMinutes

4. Agent Template Flow

5. Configuration System

Usage Examples

Basic Agent Creation

import { agentTemplates } from "@/openrouter";

// Quick start with template
const assistant = agentTemplates.assistant();

const response = await assistant.generate([
{ role: "user", content: "What are remote work benefits?" },
]);

Custom Agent with Specific Model

import { createChatAgent, deepseekModels } from "@/openrouter";

// Using helper function
const jobClassifier = createChatAgent({
id: "job-classifier",
name: "Job Classifier",
instructions: "You are an expert at classifying job postings.",
model: "chat",
});

// Or using model directly
import { Agent } from "@mastra/core/agent";

const reasoningAgent = new Agent({
model: deepseekModels.r1(),
name: "Reasoning Agent",
instructions: "Think step by step about complex problems.",
});

Advanced Configuration

import { createOpenRouter, DEEPSEEK_MODELS } from "@/openrouter";

const customProvider = createOpenRouter({
reasoning: {
max_tokens: 15000,
},
headers: {
"HTTP-Referer": "https://nomadically.work",
"X-Title": "Job Platform AI",
},
});

const model = customProvider(DEEPSEEK_MODELS.R1);

Data Flow

Request Flow

Error Handling Flow

Integration Points

Mastra Agent Integration

Environment Configuration

Required Variables

# Core configuration
OPENROUTER_API_KEY=sk-or-v1-your-api-key-here

# Optional configuration
OPENROUTER_SITE_NAME="Nomadically.work"
OPENROUTER_SITE_URL="https://nomadically.work"

Deployment Flow

Performance Characteristics

Model Comparison

Benefits

OpenRouter Advantages

Testing Strategy

Test Coverage

Run tests with:

pnpm test:openrouter

Type Safety

TypeScript Types

Migration Path

From Direct DeepSeek SDK

Resources

Summary

This OpenRouter integration provides:

  • Unified API Access - Single interface for all DeepSeek models
  • Type-Safe - Full TypeScript support with compile-time validation
  • Flexible - Three levels of abstraction for different use cases
  • Production-Ready - Error handling, fallbacks, and monitoring
  • Well-Tested - Comprehensive test suite with live API validation
  • Well-Documented - Complete examples and migration guides

The module is designed for scalability, maintainability, and developer experience while providing reliable access to state-of-the-art AI models through OpenRouter's infrastructure.

AI-Driven Company Enrichment with DeepSeek via Cloudflare Browser Rendering

· 4 min read
Vadim Nicolai
Senior Software Engineer

This page documents an AI-first enrichment pipeline that turns a company website into a clean, structured company profile you can safely persist into your database and expose through GraphQL.

The core idea is simple:

  • Use Cloudflare Browser Rendering /json to load a real rendered page (including JavaScript-heavy sites).
  • Use DeepSeek to convert the rendered page into a strict JSON-only object (no markdown, no prose).

High-level architecture

This pipeline has five clear layers, each with a single responsibility:

  • Entry: GraphQL mutation identifies the target company.
  • Acquisition: Browser Rendering fetches a fully rendered page.
  • Extraction: DeepSeek converts HTML into JSON-only structure.
  • Governance: validation, normalization, and audit snapshot.
  • Persistence: upserts for company + ATS boards, then return.

Classification

A single enum-like category so downstream logic can branch cleanly:

  • company.category is one of:
    • CONSULTANCY | AGENCY | STAFFING | DIRECTORY | PRODUCT | OTHER | UNKNOWN

UNKNOWN is intentionally allowed to prevent “forced certainty”.

Two links that unlock most automation:

  • company.careers_url — best official careers entrypoint (prefer internal)
  • company.linkedin_url — best LinkedIn company page (/company/...)

Hiring infrastructure

Detect ATS/job boards (useful for job syncing, vendor analytics, integrations):

  • ats_boards[] entries containing:
    • url
    • vendor
    • board_type (ats | careers_page | jobs_board)
    • confidence (0..1)
    • is_active

Provenance and uncertainty

To keep AI outputs accountable:

  • evidence — where it came from (URL) + any known fetch metadata
  • notes[] — uncertainty/caveats without polluting structured fields

Top-down architecture


Why Cloudflare Browser Rendering /json is the right AI boundary

Many company websites are JS-heavy (SPAs), and the key links you want (Careers, LinkedIn, ATS) often live in:

  • global navigation/header
  • footer “social” section
  • content that only appears after JS renders

The /json endpoint is designed to extract structured JSON from the rendered page, using:

  • url (or html)
  • a prompt (and optionally response_format for JSON Schema depending on provider support)
  • custom_ai to route extraction through your chosen model

For JS-heavy pages, waiting for rendering to finish matters. This is why the extractor uses:

  • gotoOptions.waitUntil = "networkidle0"

AI contract: JSON-only output

When you route through custom_ai with BYO providers, schema-enforced responses can be provider-dependent. The safest universal strategy is:

  • treat the prompt as a strict contract
  • demand ONLY valid JSON
  • define the expected shape explicitly
  • instruct null/[] for unknown values
  • push uncertainty into notes[]

This turns an LLM into a bounded parser.


Implementation: Cloudflare-first with a direct DeepSeek fallback

Below is the same flow, expressed as architecture instead of code:

  • Inputs: company id/key and target URL.
  • Acquisition: Browser Rendering /json fetches a rendered page.
  • Extraction: DeepSeek produces a JSON-only record.
  • Governance: validate, normalize, and snapshot the output.
  • Persistence: upsert company + ATS boards, then return result.

Persistence guardrails (keep the AI safe)

Even with JSON-only output, the DB write must remain your code’s responsibility.

1) Validate shape before persistence

At minimum, verify:

  • company.name exists and is non-empty
  • any present URLs are absolute (https://...)
  • arrays are arrays
  • category is one of the allowed values

If validation fails, either retry extraction (stricter prompt) or fall back.

2) Canonicalize URLs before upserts

To avoid duplicates, normalize:

  • remove #fragment
  • normalize trailing slash
  • lowercase host
  • optionally strip tracking params

3) Treat vendor and board_type as hints

LLMs can emit vendor variants (e.g., Smart Recruiters, smartrecruiters). Normalize before mapping to enums.

4) Always snapshot the raw extraction

Saving the full ExtractionResult into companySnapshots.extracted buys you:

  • debugging (“why did this change?”)
  • regression detection
  • prompt iteration without losing history

References

https://github.com/nicolad/nomadically.work

https://nomadically.work/

Agent Skills spec + Mastra integration

· 9 min read
Vadim Nicolai
Senior Software Engineer

Agent Skills Specification

Source: https://agentskills.io/specification

This document defines the Agent Skills format.

Directory structure

A skill is a directory containing at minimum a SKILL.md file:

skill-name/
└── SKILL.md # Required

Tip: You can optionally include additional directories such as scripts/, references/, and assets/ to support your skill.

SKILL.md format

The SKILL.md file must contain YAML frontmatter followed by Markdown content.

Frontmatter (required)

Minimal example:

---
name: skill-name
description: A description of what this skill does and when to use it.
---

With optional fields:

---
name: pdf-processing
description: Extract text and tables from PDF files, fill forms, merge documents.
license: Apache-2.0
metadata:
author: example-org
version: "1.0"
---
FieldRequiredNotes
nameYesMax 64 characters. Lowercase letters, numbers, and hyphens only. Must not start or end with a hyphen.
descriptionYesMax 1024 characters. Non-empty. Describes what the skill does and when to use it.
licenseNoLicense name or reference to a bundled license file.
compatibilityNoMax 500 characters. Indicates environment requirements (intended product, system packages, network access, etc.).
metadataNoArbitrary key-value mapping for additional metadata.
allowed-toolsNoSpace-delimited list of pre-approved tools the skill may use. (Experimental)
name field

The required name field:

  • Must be 1-64 characters
  • May only contain unicode lowercase alphanumeric characters and hyphens (a-z and -)
  • Must not start or end with -
  • Must not contain consecutive hyphens (--)
  • Must match the parent directory name

Valid examples:

name: pdf-processing
name: data-analysis
name: code-review

Invalid examples:

name: PDF-Processing  # uppercase not allowed
name: -pdf  # cannot start with hyphen
name: pdf--processing  # consecutive hyphens not allowed
description field

The required description field:

  • Must be 1-1024 characters
  • Should describe both what the skill does and when to use it
  • Should include specific keywords that help agents identify relevant tasks

Good example:

description: Extracts text and tables from PDF files, fills PDF forms, and merges multiple PDFs. Use when working with PDF documents or when the user mentions PDFs, forms, or document extraction.

Poor example:

description: Helps with PDFs.
license field

The optional license field:

  • Specifies the license applied to the skill
  • We recommend keeping it short (either the name of a license or the name of a bundled license file)

Example:

license: Proprietary. LICENSE.txt has complete terms
compatibility field

The optional compatibility field:

  • Must be 1-500 characters if provided
  • Should only be included if your skill has specific environment requirements
  • Can indicate intended product, required system packages, network access needs, etc.

Examples:

compatibility: Designed for Claude Code (or similar products)
compatibility: Requires git, docker, jq, and access to the internet

Note: Most skills do not need the compatibility field.

metadata field

The optional metadata field:

  • A map from string keys to string values
  • Clients can use this to store additional properties not defined by the Agent Skills spec
  • We recommend making your key names reasonably unique to avoid accidental conflicts

Example:

metadata:
author: example-org
version: "1.0"
allowed-tools field

The optional allowed-tools field:

  • A space-delimited list of tools that are pre-approved to run
  • Experimental. Support for this field may vary between agent implementations

Example:

allowed-tools: Bash(git:*) Bash(jq:*) Read

Body content

The Markdown body after the frontmatter contains the skill instructions. There are no format restrictions. Write whatever helps agents perform the task effectively.

Recommended sections:

  • Step-by-step instructions
  • Examples of inputs and outputs
  • Common edge cases

Note: The agent will load this entire file once it's decided to activate a skill. Consider splitting longer SKILL.md content into referenced files.

Optional directories

scripts/

Contains executable code that agents can run. Scripts should:

  • Be self-contained or clearly document dependencies
  • Include helpful error messages
  • Handle edge cases gracefully

Supported languages depend on the agent implementation. Common options include Python, Bash, and JavaScript.

references/

Contains additional documentation that agents can read when needed:

  • REFERENCE.md - Detailed technical reference
  • FORMS.md - Form templates or structured data formats
  • Domain-specific files (finance.md, legal.md, etc.)

Keep individual reference files focused. Agents load these on demand, so smaller files mean less use of context.

assets/

Contains static resources:

  • Templates (document templates, configuration templates)
  • Images (diagrams, examples)
  • Data files (lookup tables, schemas)

Progressive disclosure

Skills should be structured for efficient use of context:

  1. Metadata (~100 tokens): The name and description fields are loaded at startup for all skills
  2. Instructions (< 5000 tokens recommended): The full SKILL.md body is loaded when the skill is activated
  3. Resources (as needed): Files (e.g. those in scripts/, references/, or assets/) are loaded only when required

Keep your main SKILL.md under 500 lines. Move detailed reference material to separate files.

File references

When referencing other files in your skill, use relative paths from the skill root:

See [the reference guide](references/REFERENCE.md) for details.

Run the extraction script:
scripts/extract.py

Keep file references one level deep from SKILL.md. Avoid deeply nested reference chains.

Validation

Use the skills-ref reference library to validate your skills:

skills-ref validate ./my-skill

This checks that your SKILL.md frontmatter is valid and follows all naming conventions.


Documentation index first

The Agent Skills docs are designed to be discovered via a single index file (llms.txt). Use that as the entrypoint whenever you’re exploring the spec surface area.


What are skills?

Agent Skills are a lightweight, file-based format for packaging reusable agent instructions and workflows (plus optional scripts/assets). Agents use progressive disclosure:

  1. Discovery: load only name + description metadata
  2. Activation: load the full SKILL.md body for a matching task
  3. Execution: read references / run scripts as needed
    oaicite:1

Skill directory structure

Minimum required:

skill-name/
└── SKILL.md

Common optional directories (same convention is used by Mastra workspaces):

skill-name/
├── SKILL.md
├── references/ # extra docs (optional)
├── scripts/ # executable code (optional)
└── assets/ # templates/images/etc. (optional)

SKILL.md specification essentials

Frontmatter requirements

SKILL.md must start with YAML frontmatter with at least:

  • name (strict naming constraints; should match the folder name)
  • description (non-empty; should say what + when; include “trigger keywords”)

Optional fields defined by the spec include license, compatibility, metadata, and experimental allowed-tools.

Body content

After frontmatter: normal Markdown instructions. The spec recommends practical steps, examples, and edge cases (and keeping SKILL.md reasonably small to support progressive disclosure).

A spec-friendly template

---
name: code-review
description: Reviews code for quality, style, and potential issues. Use when asked to review PRs, diffs, TypeScript/Node projects, or linting failures.
license: Apache-2.0
compatibility: Requires node and access to repository files
metadata:
version: "1.0.0"
tags: "development review"
---

# Code Review

## When to use this skill
- Trigger phrases: "review this PR", "code review", "lint errors", "style guide"

## Procedure
1. Identify the change scope and risk.
2. Check for correctness, edge cases, and error handling.
3. Verify style rules in references/style-guide.md.
4. If available, run scripts/lint.ts and summarize results.

## Output format
- Summary
- Issues (by severity)
- Suggested diffs
- Follow-ups/tests

Note: Mastra’s docs show version and tags as top-level keys in frontmatter. Depending on your validator/tooling, the safest cross-implementation choice is to store extras under metadata. (mastra.ai)


Mastra integration

Mastra workspaces support skills starting in @mastra/core@1.1.0. (mastra.ai)

1) Place skills under your workspace filesystem basePath

Mastra treats skill paths as relative to the workspace filesystem basePath. (mastra.ai)

In your repo, the main workspace is configured with:

  • basePath: "./src/workspace"
  • skills: ["/skills"]

That means the actual on-disk skills folder should be:

./src/workspace/skills/
/your-skill-name/
SKILL.md

2) Configure skills on a workspace

Mastra enables discovery by setting skills on the workspace. (mastra.ai)

import { Workspace, LocalFilesystem } from "@mastra/core/workspace";

export const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: "./src/workspace" }),
skills: ["/skills"],
});

You can provide multiple skill directories (still relative to basePath). (mastra.ai)

skills: [
"/skills", // Project skills
"/team-skills", // Shared team skills
],

3) Dynamic skill directories (context-aware)

Mastra also supports a function form for skills, so you can vary skill sets by user role, tenant, environment, etc. (mastra.ai)

skills: (context) => {
const paths = ["/skills"];
if (context.user?.role === "developer") paths.push("/dev-skills");
return paths;
},

4) What Mastra does “under the hood”

When a skill is activated, its instructions are added to the conversation context and the agent can access references/scripts in that skill folder. Mastra describes the runtime flow as: (mastra.ai)

  1. List available skills in the system message
  2. Allow agents to activate skills during conversation
  3. Provide access to skill references and scripts

This maps cleanly onto the Agent Skills “discovery → activation → execution” model. (agentskills.io)

5) Skill search and indexing in Mastra

Mastra workspaces support BM25, vector, and hybrid search. (mastra.ai)

If BM25 or vector search is enabled, Mastra will automatically index skills so agents can search within skill content to find relevant instructions. (mastra.ai)

Example (BM25-only):

const workspace = new Workspace({
filesystem: new LocalFilesystem({ basePath: "./src/workspace" }),
skills: ["/skills"],
bm25: true,
});

If you enable vector or hybrid search, indexing uses your embedder and vector store (and BM25 uses tokenization + term statistics). (mastra.ai)


Repo conventions that work well

  • One skill per folder, folder name matches frontmatter.name.

  • Keep SKILL.md focused on the “operator manual”; push deep theory to references/.

  • Put runnable helpers in scripts/ and make them deterministic (clear inputs/outputs).

  • Treat destructive actions as opt-in:

    • Use workspace tool gating (approval required, delete disabled) for enforcement.
    • Optionally declare allowed-tools in SKILL.md for portability across other skill runtimes. (agentskills.io)

AI-Powered Skill Extraction with Cloudflare Embeddings and a Vector Taxonomy

· 4 min read
Vadim Nicolai
Senior Software Engineer

This bulk processor extracts structured skill tags for job postings using an AI pipeline that combines:

  • Embedding generation via Cloudflare Workers AI (@cf/baai/bge-small-en-v1.5, 384-dim)
  • Vector retrieval over a skills taxonomy (Turso/libSQL index skills_taxonomy) for candidate narrowing
  • Mastra workflow orchestration for LLM-based structured extraction + validation + persistence
  • Production-grade run controls: robust logging, progress metrics, graceful shutdown, and per-item failure isolation

It’s designed for real-world runs where you expect rate limits, transient failures, and safe restarts.


Core constraint: embedding dimension ↔ vector index schema

The taxonomy retrieval layer is backed by a Turso/libSQL vector index:

  • Index name: skills_taxonomy
  • Embedding dimension (required): 384
  • Embedding model: @cf/baai/bge-small-en-v1.5 (384-dim)

If the index dimension isn’t 384, vector search can fail or degrade into meaningless similarity scores.
The script prevents this by validating stats.dimension === 384 before processing.


Architecture overview (pipeline flow)


Retrieval + extraction: what happens per job

  • Convert relevant job text to embeddings using Cloudflare Workers AI.
  • Use vector similarity search in skills_taxonomy to retrieve top-N candidate skills.
  • Candidates constrain the downstream LLM step (better precision, lower cost).

2) Extraction: structured inference via Mastra workflow

A cached Mastra workflow (extractJobSkillsWorkflow) performs:

  • prompt + schema-driven extraction
  • normalization (matching to taxonomy terms/ids)
  • validation (reject malformed outputs)
  • persistence into job_skill_tags

On failure, the script logs workflow status and step details for debugging.


Cloudflare Workers AI embeddings

Model contract and hardening

  • Model: @cf/baai/bge-small-en-v1.5
  • Vectors: 384 dimensions
  • Input contract: strict array of strings
  • Timeout: 45s (AbortController)
  • Output contract: explicit response shape checks (fail early on unexpected payloads)

This is important because embedding pipelines can silently drift if the response shape changes or inputs are malformed.

Dimension enforcement (non-negotiable)

If skills_taxonomy was created/seeded with a different dimension:

  • similarity search becomes invalid (best case: errors; worst case: plausible-but-wrong matches)

The script enforces stats.dimension === 384 to keep retrieval semantically meaningful.


Turso/libSQL vector taxonomy index

  • Storage: Turso (libSQL)
  • Index: skills_taxonomy
  • Schema dimension: 384
  • Role: retrieval layer for skills ontology/taxonomy

The script also ensures the index is populated (count > 0), otherwise it fails fast and directs you to seed.


Reliability and operational controls

Observability: console + file tee logs

  • tees console.log/warn/error to a timestamped file and the terminal
  • log naming: extract-job-skills-<ISO timestamp>-<pid>.log
  • degrades to console-only logging if file IO fails

Graceful termination

  • SIGINT / SIGTERM sets a shouldStop flag
  • the loop exits after the current job completes
  • avoids interrupting in-flight workflow steps (embedding/LLM/DB writes)

Idempotency / restart safety

Even after selecting jobs without tags, the script re-checks:

  • jobAlreadyHasSkills(jobId)

This avoids duplicate inference when:

  • you restart mid-run
  • multiple workers run concurrently
  • the initial query snapshot becomes stale

Throughput shaping

  • sequential processing
  • a fixed 1s backoff between jobs (simple, reliable rate-limit mitigation)

Failure modes

Retrieval layer failures (index health)

Triggers:

  • index missing
  • dimension mismatch (not 384)
  • empty index (count === 0)

Behavior: fail fast with actionable logs (recreate index / re-seed / verify DB target).

Embedding timeouts

Symptom: embedding call exceeds 45s and aborts. Behavior: job fails; run continues.

Mitigations:

  • chunk long descriptions upstream
  • add retry/backoff on transient 429/5xx
  • monitor Workers AI service health

Workflow failures

Behavior: job is marked failed; run continues. Logs include step trace and error payload to accelerate debugging.


Quick reference

  • Embeddings: Cloudflare Workers AI @cf/baai/bge-small-en-v1.5 (384-dim)
  • Retrieval: Turso/libSQL vector index skills_taxonomy (384-dim)
  • Orchestration: Mastra workflow extractJobSkillsWorkflow
  • Persistence: job_skill_tags
  • Embedding timeout: 45s
  • Stop behavior: graceful after current job (SIGINT / SIGTERM)

AI Observability for LLM Evals with Langfuse

· 10 min read
Vadim Nicolai
Senior Software Engineer

This article documents an evaluation harness for a Remote EU job classifier—but the real focus is AI observability: how to design traces, spans, metadata, scoring, and run-level grouping so you can debug, compare, and govern LLM behavior over time.

The script runs a batch of curated test cases, loads the latest production prompt from Langfuse (with a safe fallback), executes a structured LLM call, scores results, and publishes metrics back into Langfuse. That gives you:

  • Reproducibility (prompt versions + test set + session IDs)
  • Debuggability (one trace per test case; inspect inputs/outputs)
  • Comparability (run-level aggregation; trend metrics across changes)
  • Operational safety (flush guarantees, CI thresholds, rate-limit control)

Why "observability-first" evals matter

A typical eval script prints expected vs actual and calls it a day. That's not enough once you:

  • iterate prompts weekly,
  • swap models,
  • add guardrails,
  • change schemas,
  • tune scoring rubrics,
  • and need to explain regressions to humans.

Observability-first evals answer questions like:

  • Which prompt version produced the regression?
  • Is accuracy stable but confidence becoming overconfident?
  • Are failures clustered by location phrasing ("EMEA", "EU timezone", "Worldwide")?
  • Did we increase tokens/latency without improving correctness?
  • Can we click from CI logs straight into the trace of the failing example?

Langfuse becomes your "flight recorder": the trace is the unit of truth for what happened.


End-to-end architecture


Observability design: what gets traced and why

Trace strategy: one trace per test case

Principle: if you can't click into an individual example, you can't debug.

Each test case produces a Langfuse trace (think "request-level unit"), tagged with:

  • sessionId: groups a full run (critical for comparisons)
  • testCaseId, description: anchors the trace to your dataset
  • prompt metadata: name/label/version/hash (ideal)
  • model metadata: provider, model name, parameters (ideal)

This makes failures navigable and filterable.

Span strategy: one generation per model call

Inside each trace, you create a generation span for the model call:

  • captures input (prompt + job posting)
  • captures output (structured object + reason)
  • captures usage (token counts)
  • optionally captures latency (recommended)
  • optionally captures model params (temperature, top_p, etc.)

Even if the script is "just evals," treat each example like production traffic. That's how you build a reliable debugging workflow.


Prompt governance: Langfuse prompts + fallback behavior

Your harness fetches a prompt by name and label:

  • name: job-classifier
  • label: production

If prompt retrieval fails or is disabled (e.g., SKIP_LANGFUSE_PROMPTS=true), it uses a local fallback prompt.

Observability tip: always record the effective prompt identity

To compare runs, you want "which exact prompt did this use?" in trace metadata. If your prompt fetch returns versioning info, store:

  • promptName
  • promptLabel
  • promptVersion or promptId or promptHash

If it does not return version info, you can compute a stable hash of the prompt text and store that (lightweight, extremely useful).


Structured output: Zod as an observability contract

The classifier returns:

  • isRemoteEU: boolean
  • confidence: "high" | "medium" | "low"
  • reason: string

Why structured output is observability, not just "parsing"

A strict schema:

  • removes ambiguity ("was that JSON-ish text or valid data?")
  • enables stable scoring and aggregation
  • prevents downstream drift as prompts change
  • improves triage because the same fields are always present

If you ever add fields like region, countryHints, remotePolicy, do it via schema extension and keep historical compatibility in your scorer.


The full eval lifecycle as a trace model

This is what you want stored per test case:

When a case fails, you should be able to answer in one click:

  • Which prompt version?
  • What input text exactly?
  • What output object exactly?
  • What scoring decision and why?
  • Was the model "confidently wrong"?

Scoring and metrics: accuracy is necessary, insufficient

Your harness logs two scores:

  1. remote-eu-accuracy A numeric score from your scorer. This can be binary (0/1) or continuous (0..1). Continuous is often better because it supports partial credit and more informative trend analysis.

  2. confidence-match A binary score (1/0) tracking whether the model's confidence matches expected confidence.

Observability tip: store scorer metadata as the comment (or trace metadata)

A score without context is hard to debug. For incorrect cases, write comments like:

  • expected vs actual isRemoteEU
  • expected vs actual confidence
  • a short reason ("Predicted EU-only due to 'EMEA' but posting says US time zones")

Also consider storing structured metadata (if your Langfuse SDK supports it) so you can filter/group later.


Run-level grouping: session IDs as your "eval run" primitive

A sessionId = eval-${Date.now()} groups the whole batch. This enables:

  • "show me all traces from the last run"
  • comparisons across runs
  • slicing by prompt version across sessions
  • CI links that land you on the run dashboard

Recommendation: include additional stable tags:

  • gitSha, branch, ciBuildId (if running in CI)
  • model and promptVersion (for quick comparisons)

Even if you don't have them now, design the metadata schema so adding them later doesn't break anything.


Mermaid: evaluation flow, sequence, and data model (together)

1) Flow: control plane of the batch run

2) Sequence: what actually happens per case

3) Data model: eval artifacts


How to run (and make it debuggable in one click)

Environment variables

Required:

  • LANGFUSE_SECRET_KEY
  • LANGFUSE_PUBLIC_KEY
  • LANGFUSE_BASE_URL
  • DEEPSEEK_API_KEY

Optional:

  • SKIP_LANGFUSE_PROMPTS=true (use local prompt fallback)

Run:

pnpm tsx scripts/eval-remote-eu-langfuse.ts

Local prompt fallback:

SKIP_LANGFUSE_PROMPTS=true pnpm tsx scripts/eval-remote-eu-langfuse.ts

Observability tip: print a stable "run header"

In console output (and CI logs), it helps to print:

  • sessionId
  • model name
  • prompt version/hash
  • total test cases

That turns logs into an index into Langfuse.


Debugging workflow: from CI failure to root cause

When accuracy drops below threshold and CI fails, you want a deterministic workflow:

  1. Open the Langfuse session for the run (grouped by sessionId)

  2. Filter traces where remote-eu-accuracy = 0 (or below some threshold)

  3. For each failing trace:

    • check prompt version/hash
    • check job posting input text (location phrasing is often the culprit)
    • inspect structured output (especially confidence)
    • read the reason for the scorer's decision

Practical tips & gotchas (observability edition)

1) Always flush telemetry

If you exit early, you can lose the most important traces. Ensure flushAsync() happens even on errors (e.g., in a finally block) and only exit after flush completes.

2) Don't parallelize blindly

Parallel execution improves speed but can:

  • amplify rate limits
  • introduce noisy latency
  • create non-deterministic output ordering in logs

If you do parallelize, use bounded concurrency and capture per-case timing.

3) Track prompt identity, not just prompt text

Prompt text alone is hard to compare across runs. Record version/hash so you can correlate changes with performance.

4) Separate "correctness" from "calibration"

A model can get higher accuracy while becoming confidently wrong on edge cases. Keeping confidence-match (or richer calibration metrics later) prevents hidden regressions.

5) Add slice metrics before you add more test cases

Instead of only "overall accuracy," compute accuracy by category:

  • "EU-only"
  • "Worldwide remote"
  • "EMEA" phrasing
  • "Hybrid" / "On-site"
  • "Contractor / employer-of-record constraints"

This reveals what's actually breaking when a prompt changes.


Suggested next upgrades (high leverage)

A) Add latency and cost proxies

Record:

  • duration per generation span (ms)
  • token totals per case

Then you can chart:

  • cost/latency vs accuracy
  • regressions where prompt got longer but not better

B) Add a "reason quality" score (optional, small rubric)

Create a third score like reason-quality to detect when explanations degrade (too vague, irrelevant, or missing key constraints). Keep it light—don't overfit to phrasing.

C) Prompt A/B within the same run

Evaluate production vs candidate prompts on the same test set:

  • two sessions (or two labels within one session)
  • compare metrics side-by-side in Langfuse

Docusaurus note: Mermaid support

If Mermaid isn't rendering, enable it in Docusaurus:

// docusaurus.config.js
const config = {
markdown: { mermaid: true },
themes: ["@docusaurus/theme-mermaid"],
};
module.exports = config;

The takeaway: observability is the eval superpower

A well-instrumented eval harness makes improvements measurable and regressions explainable:

  • traces turn examples into clickable evidence
  • structured outputs stabilize scoring
  • session IDs make runs comparable
  • multiple metrics prevent hidden failure modes

If you treat evals like production requests—with traces, spans, and scores—you'll iterate faster and break less.

Schema-First RAG with Eval-Gated Grounding and Claim-Card Provenance

· 7 min read
Vadim Nicolai
Senior Software Engineer

This article documents a production-grade architecture for generating research-grounded therapeutic content. The system prioritizes verifiable artifacts (papers → structured extracts → scored outputs → claim cards) over unstructured text.

You can treat this as a “trust pipeline”: retrieve → normalize → extract → score → repair → persist → generate.