Skip to main content

AI-Powered Skill Extraction with Cloudflare Embeddings and a Vector Taxonomy

· 4 min read
Vadim Nicolai
Senior Software Engineer

This bulk processor extracts structured skill tags for job postings using an AI pipeline that combines:

  • Embedding generation via Cloudflare Workers AI (@cf/baai/bge-small-en-v1.5, 384-dim)
  • Vector retrieval over a skills taxonomy (Turso/libSQL index skills_taxonomy) for candidate narrowing
  • Mastra workflow orchestration for LLM-based structured extraction + validation + persistence
  • Production-grade run controls: robust logging, progress metrics, graceful shutdown, and per-item failure isolation

It’s designed for real-world runs where you expect rate limits, transient failures, and safe restarts.


Core constraint: embedding dimension ↔ vector index schema

The taxonomy retrieval layer is backed by a Turso/libSQL vector index:

  • Index name: skills_taxonomy
  • Embedding dimension (required): 384
  • Embedding model: @cf/baai/bge-small-en-v1.5 (384-dim)

If the index dimension isn’t 384, vector search can fail or degrade into meaningless similarity scores.
The script prevents this by validating stats.dimension === 384 before processing.


Architecture overview (pipeline flow)


Retrieval + extraction: what happens per job

  • Convert relevant job text to embeddings using Cloudflare Workers AI.
  • Use vector similarity search in skills_taxonomy to retrieve top-N candidate skills.
  • Candidates constrain the downstream LLM step (better precision, lower cost).

2) Extraction: structured inference via Mastra workflow

A cached Mastra workflow (extractJobSkillsWorkflow) performs:

  • prompt + schema-driven extraction
  • normalization (matching to taxonomy terms/ids)
  • validation (reject malformed outputs)
  • persistence into job_skill_tags

On failure, the script logs workflow status and step details for debugging.


Cloudflare Workers AI embeddings

Model contract and hardening

  • Model: @cf/baai/bge-small-en-v1.5
  • Vectors: 384 dimensions
  • Input contract: strict array of strings
  • Timeout: 45s (AbortController)
  • Output contract: explicit response shape checks (fail early on unexpected payloads)

This is important because embedding pipelines can silently drift if the response shape changes or inputs are malformed.

Dimension enforcement (non-negotiable)

If skills_taxonomy was created/seeded with a different dimension:

  • similarity search becomes invalid (best case: errors; worst case: plausible-but-wrong matches)

The script enforces stats.dimension === 384 to keep retrieval semantically meaningful.


Turso/libSQL vector taxonomy index

  • Storage: Turso (libSQL)
  • Index: skills_taxonomy
  • Schema dimension: 384
  • Role: retrieval layer for skills ontology/taxonomy

The script also ensures the index is populated (count > 0), otherwise it fails fast and directs you to seed.


Reliability and operational controls

Observability: console + file tee logs

  • tees console.log/warn/error to a timestamped file and the terminal
  • log naming: extract-job-skills-<ISO timestamp>-<pid>.log
  • degrades to console-only logging if file IO fails

Graceful termination

  • SIGINT / SIGTERM sets a shouldStop flag
  • the loop exits after the current job completes
  • avoids interrupting in-flight workflow steps (embedding/LLM/DB writes)

Idempotency / restart safety

Even after selecting jobs without tags, the script re-checks:

  • jobAlreadyHasSkills(jobId)

This avoids duplicate inference when:

  • you restart mid-run
  • multiple workers run concurrently
  • the initial query snapshot becomes stale

Throughput shaping

  • sequential processing
  • a fixed 1s backoff between jobs (simple, reliable rate-limit mitigation)

Failure modes

Retrieval layer failures (index health)

Triggers:

  • index missing
  • dimension mismatch (not 384)
  • empty index (count === 0)

Behavior: fail fast with actionable logs (recreate index / re-seed / verify DB target).

Embedding timeouts

Symptom: embedding call exceeds 45s and aborts. Behavior: job fails; run continues.

Mitigations:

  • chunk long descriptions upstream
  • add retry/backoff on transient 429/5xx
  • monitor Workers AI service health

Workflow failures

Behavior: job is marked failed; run continues. Logs include step trace and error payload to accelerate debugging.


Quick reference

  • Embeddings: Cloudflare Workers AI @cf/baai/bge-small-en-v1.5 (384-dim)
  • Retrieval: Turso/libSQL vector index skills_taxonomy (384-dim)
  • Orchestration: Mastra workflow extractJobSkillsWorkflow
  • Persistence: job_skill_tags
  • Embedding timeout: 45s
  • Stop behavior: graceful after current job (SIGINT / SIGTERM)