Press: A Multi-Agent LangGraph Pipeline That Researches, Writes, and Publishes Technical Articles

April 7, 2026 · 12 min read

Senior Software Engineer

Most AI writing tools produce first drafts. The problem is not the draft — it is everything around the draft. Research that hallucinates sources. SEO keywords stuffed into headers. Links that 404. Claims that cannot be traced back to any paper. An editor who cannot explain why the article fails. A publish step that requires manual copy-paste into a CMS. Each gap is a manual step, and manual steps do not scale.

Press is a LangGraph pipeline that eliminates every manual step between "I have a topic" and "the article is live on the blog." It coordinates 10 specialized agents across four sub-pipelines — article, blog, counter-article, and review — each backed by DeepSeek Reasoner for heavy reasoning and DeepSeek Chat for fast tasks. A single press article --topic "..." --publish command triggers research across 4 academic APIs and 16 editorial publications, writes a journalism-grade article, validates every hyperlink for reachability and anchor quality, runs 7 automated quality evaluations, and publishes directly to a Docusaurus blog with a git commit and Vercel deploy.

This article documents the architecture, the agent design, and the quality enforcement mechanisms that make the pipeline produce articles that pass editorial review without human intervention.

Architecture: One Orchestrator, Four Sub-Graphs

The main graph is a StateGraph with a single conditional edge at the entry point. A route_pipeline function reads the pipeline field from the input state and dispatches to one of four compiled sub-graphs:

Loading diagram…

def route_pipeline(state: PressState) -> str:
    pipeline = state.get("pipeline", "")
    if pipeline not in _PIPELINE_NAMES:
        available = ", ".join(_PIPELINE_NAMES)
        raise ValueError(
            f"Unknown pipeline {pipeline!r}. "
            f"Set 'pipeline' to one of: {available}"
        )
    return pipeline

Each sub-graph is a self-contained StateGraph with its own input schema, internal state, and node implementations. The orchestrator delegates via ainvoke and collects the result. This means each pipeline can be developed, tested, and invoked independently — press article bypasses the orchestrator entirely and calls build_article_graph directly.

The Article Pipeline: Eight Nodes, Two Modes

The article pipeline is the core of Press. It operates in two modes from the same graph structure — the nodes adapt their behavior based on whether source material is present in state:

Journalism mode (press article --topic "..."): editorial search + SEO, then a 1200-1800 word article
Deep-dive mode (press article --topic "..." --input notes.md): reads source material, searches 4 academic databases in parallel, then a 2500-3500 word long-form article

Loading diagram…

The graph flows through eight nodes: read_source (no-op in journalism mode), research_and_seo, write, xyflow (diagram generation), check_references, edit, and then a conditional edge that routes to publish, revise, or save_final.

The conditional routing after the editor is the key quality gate:

def should_revise_simple(state: dict) -> str:
    if state.get("approved"):
        return "publish"
    if state.get("revision_rounds", 0) >= MAX_REVISIONS:
        return "save_final"
    return "revise"

If the editor approves, the article goes to publish. If not, the writer gets one revision round — the editor's full feedback is injected as context alongside the original research brief and the previous draft. If revision fails to satisfy the editor, the article is saved to drafts/ for manual intervention. The cap at one revision is intentional: in practice, a second LLM revision rarely fixes what the first one missed, and it risks introducing new problems.

Model Routing: Two Models, Three Roles

The entire pipeline runs on two DeepSeek models. A ModelPool maps three logical roles to two physical models:

class TeamRole(Enum):
    REASONER = "reasoner"    # deepseek-reasoner
    FAST = "fast"            # deepseek-chat
    REVIEWER = "reviewer"    # deepseek-reasoner

class ModelPool:
    def for_role(self, role: TeamRole) -> ChatOpenAI:
        if role in (TeamRole.REASONER, TeamRole.REVIEWER):
            return self.reasoner
        return self.fast

Reasoner handles research synthesis, writing, editing, and revision — tasks where chain-of-thought reasoning directly improves output quality. Fast handles SEO discovery, SEO blueprints, topic scouting, topic picking, diagram generation, and report synthesis — tasks where speed matters more than depth. The reviewer role maps to the same reasoner model because editorial review requires the same reasoning depth as writing.

Each agent is a thin wrapper that sends a system prompt and user message, with exponential-backoff retry:

class Agent:
    def __init__(self, name: str, system_prompt: str, model: ChatOpenAI):
        self.name = name
        self.system_prompt = system_prompt
        self.model = model

    async def run(self, input_text: str) -> str:
        messages = [
            SystemMessage(content=self.system_prompt),
            HumanMessage(content=input_text),
        ]
        for attempt in range(MAX_RETRIES):
            try:
                response = await self.model.ainvoke(messages)
                if response.content:
                    return response.content
            except Exception as e:
                if attempt < MAX_RETRIES - 1:
                    await asyncio.sleep(2**attempt)
                    continue
                raise

No tool use, no function calling, no structured output parsing. The agents communicate through markdown text passed via graph state. This keeps the agent implementation trivial and the debugging surface small — every agent input and output is a readable string you can inspect in the articles/research/ and articles/drafts/ directories.

Research Phase: Four Academic APIs in Parallel

Deep-dive mode triggers a parallel search across four academic databases: Semantic Scholar, OpenAlex, Crossref, and CORE. Each client is a thin async wrapper around the respective REST API.

Loading diagram…

Query expansion generates two variants — the original topic and a shorter 4-word keyword extraction — to maximize recall across databases with different search semantics:

def _expand_query(query: str) -> list[str]:
    queries = [query]
    words = query.split()
    if len(words) > 4:
        short = " ".join(words[:4])
        queries.append(short)
    return queries

The results from all databases and all query variants are pooled, then deduplicated by normalized title and DOI. Ranking combines citation count with a recency boost — papers from the last year get +50 points, 2-3 year papers get +30, and older papers get diminishing bonuses:

def _score_paper(paper: ResearchPaper) -> float:
    citations = paper.citation_count or 0
    recency_boost = 0.0
    if paper.year:
        age = CURRENT_YEAR - paper.year
        if age <= 1:
            recency_boost = 50.0
        elif age <= 2:
            recency_boost = 30.0
        elif age <= 3:
            recency_boost = 15.0
        elif age <= 5:
            recency_boost = 5.0
    return citations + recency_boost

The top 10 papers are formatted into a markdown digest that the researcher agent synthesizes into a structured research brief. In journalism mode, the paper search is skipped — only editorial sources (16 publications including Towards Data Science, InfoQ, The New Stack, KDnuggets, and others) are searched and fed to the researcher.

Reference Quality: Not Just Broken Links

Most link checkers answer a binary question: is the URL reachable? Press answers four questions about every [anchor](url) citation in the draft:

Is it reachable? HEAD request with GET fallback, lenient on bot-blocking domains (arxiv, LinkedIn, Twitter), retry on 429
Is the anchor text descriptive? "here", "this", "click", "source" are flagged as weak — the anchor should describe what the reader will find
What tier is the domain? Three tiers: authoritative (arxiv, nature.com, github.com, openai.com), credible (stackoverflow, techcrunch, medium), and generic
Is citation density sufficient? Target is 1 citation per 300 words; articles with fewer inline references are flagged

@dataclass
class ReferenceReport:
    refs: list[ReferenceResult]
    bare_urls: list[str]
    word_count: int

    @property
    def score(self) -> float:
        if self.total == 0:
            return 0.0
        broken_pen   = len(self.broken) / self.total
        weak_pen     = len(self.weak_anchors) / self.total * 0.4
        auth_bonus   = len(self.authoritative) / self.total * 0.3
        count_score  = min(1.0, self.total / 5) * 0.4
        return max(0.0, min(1.0,
            (1.0 - broken_pen - weak_pen + auth_bonus) * 0.6 + count_score
        ))

The reference report is injected into the editor's context as a mandatory section. If there are broken links, weak anchors, or low citation density, the editor sees them as issues that must be addressed before approving. This creates a hard quality gate: articles with zero inline citations are blocked from deployment even if the editor says "APPROVE."

has_no_refs = any("no_inline_refs" in i for i in report.issues)
if has_no_refs:
    logger.error(
        "BLOCKED deploy: article has zero inline [anchor](url) citations."
    )

Seven Automated Quality Evaluations

The press eval command and the review pipeline run seven GEval metrics powered by deepeval, each with a minimum threshold:

Loading diagram…

Metric	What it checks	Threshold
`source_citation`	Every factual claim links to a primary source	0.70
`anti_hallucination`	No facts invented outside the research brief	0.70
`writing_quality`	Active voice, no filler, sentences under 25 words	0.60
`journalistic_standards`	Inverted pyramid, named attribution, balance, no hype	0.70
`seo_alignment`	Primary keyword in H1 and first 100 words, meta description	0.60
`structural_completeness`	Frontmatter, 4+ H2 sections, counterargument, 1200-1800w	0.70
`lead_quality`	Opening hook is specific, immediate, no cliches	0.60

The anti-hallucination metric is the most valuable. It compares every factual claim in the article against the research brief:

A claim is HALLUCINATED if it is absent from the research brief entirely, inflates or distorts a figure from the brief (e.g., brief says 34%, article says 41%), presents a "Needs Verification" item as an established fact, or attributes a finding to a study not mentioned in the brief.

Each metric runs as an independent async task. The eval model defaults to deepseek-chat but falls back to gpt-4o-mini if no DeepSeek key is available:

def _configure_eval_llm() -> str:
    if dk := os.getenv("DEEPSEEK_API_KEY"):
        if not os.getenv("OPENAI_API_KEY"):
            os.environ["OPENAI_API_KEY"] = dk
        os.environ["OPENAI_BASE_URL"] = "https://api.deepseek.com/v1"
        return os.getenv("EVAL_MODEL", "deepseek-chat")
    return os.getenv("EVAL_MODEL", "gpt-4o-mini")

The Review Pipeline: Publication Fit Scoring

The review pipeline is a standalone graph for evaluating existing drafts without writing new content. It runs five nodes with a fan-out/fan-in pattern:

Loading diagram…

After reading files and checking references, two independent tasks execute in parallel: publication fit scoring (against 20 niche publications) and the 7-metric automated eval. Both results feed into an editorial review node, which produces a final assessment informed by all available signals. A reporter agent synthesizes everything into a single review report saved to disk.

The publication fit scorer evaluates the draft against 20 target publications — from AI-focused outlets (Weights & Biases, KDnuggets, ML Mastery) to developer platforms (InfoQ, The New Stack, DZone) to Medium publications (Towards Data Science, Better Programming). For each publication, it scores fit based on topic alignment, technical depth, writing style, and audience match.

The Blog Pipeline: Scout, Pick, Fan-Out

The blog pipeline discovers topics rather than writing about a given one. Three nodes execute sequentially:

Scout: finds 5 trending topics in a niche using DeepSeek Chat
Picker: scores each topic on misconception potential, source availability, audience pain, and originality — selects the top N
Process topics: fans out with asyncio.gather, running research + write + optional publish for each selected topic in parallel

async def process_topics(state: BlogState) -> dict:
    selections = json.loads(strip_fences(state["picker_output"]))

    async def process_one(i: int, sel: dict) -> dict:
        topic = sel["topic"]
        researcher = Agent(
            f"researcher[{i}]",
            prompts.researcher(niche),
            pool.for_role(TeamRole.REASONER),
        )
        notes = await researcher.run(
            f"Topic: {topic}\nAngle: {sel.get('angle', '')}\n"
        )
        writer = Agent(
            f"writer[{i}]",
            prompts.writer(),
            pool.for_role(TeamRole.REASONER),
        )
        blog = await writer_agent.run(notes)
        blog = await add_xyflow(pool, blog)
        return {"topic": topic, "slug": slugify(topic), "blog": blog}

    topics = await asyncio.gather(
        *[process_one(i, sel) for i, sel in enumerate(selections[:count])]
    )
    return {"topics": list(topics)}

Publishing: Frontmatter, Git, Vercel

The publisher handles the last mile — transforming a markdown draft into a deployed blog post. It parses existing YAML frontmatter, builds a slug from the SEO strategy (falling back to the title), forces the date to today (LLM-generated dates are unreliable), writes the file to blog/{year}/{MM-DD-slug}/index.md, and optionally runs git commit && git push && vercel deploy --prod.

Pre-publish validation catches five common LLM publishing mistakes:

Double frontmatter — the LLM wraps content in --- blocks that nest inside the real frontmatter
Empty description — meaningless or missing meta description
Slug-word tags — tags that are just the slug split on hyphens instead of real topic labels
Stale dates — dates copied from a source article instead of the publish date
Missing inline links — articles with fewer than 3 hyperlinks are blocked

def validate_before_publish(md: str) -> list[str]:
    issues: list[str] = []
    meta, body = parse_frontmatter(md)
    if body.lstrip().startswith("---"):
        issues.append("double_frontmatter")
    inline_links = re.findall(r"\[.+?\]\(https?://[^)]+\)", body)
    if len(inline_links) < 3:
        issues.append(f"few_inline_links: only {len(inline_links)}")
    return issues

CLI: Six Commands

The full CLI surface:

# Write a journalism article (1200-1800w)
press article --topic "The real cost of AI inference at scale"

# Write a deep-dive with paper search (2500-3500w)
press article --topic "..." --input notes.md --publish --git-push

# Discover topics and write multiple posts
press blog --niche "AI infrastructure" --count 3 --publish

# Write a rebuttal against a source URL
press counter --url "https://..." --topic "Why the data disagrees"

# Review an existing draft
press review --input draft.md --research brief.md --seo seo.md

# Run quality evals standalone
press eval --input article.md --metrics source_citation,anti_hallucination

Every command writes intermediate artifacts to disk — research briefs, SEO strategies, drafts, revision notes, reference reports — so you can inspect exactly what each agent produced and debug failures without re-running the full pipeline.

What Makes This Work

Three design decisions keep the pipeline reliable:

Text-only agent communication. No tool use, no JSON schemas, no structured output. Every agent reads markdown and writes markdown. The graph state carries strings. This eliminates an entire class of parsing failures and makes every intermediate artifact human-readable.

Hard quality gates. The reference checker blocks deployment of articles with zero citations regardless of what the editor says. The pre-publish validator catches double frontmatter and stale dates. These are not suggestions — they are runtime checks that prevent broken content from reaching production.

Capped revision loops. One revision round, then save to disk. The alternative — letting the writer and editor iterate indefinitely — converges on mediocrity as each round smooths out the specific claims and vivid language that made the draft worth reading. One targeted revision preserves voice; three rounds produce committee prose.

Architecture: One Orchestrator, Four Sub-Graphs​

The Article Pipeline: Eight Nodes, Two Modes​

Model Routing: Two Models, Three Roles​

Research Phase: Four Academic APIs in Parallel​

Reference Quality: Not Just Broken Links​

Seven Automated Quality Evaluations​

The Review Pipeline: Publication Fit Scoring​

The Blog Pipeline: Scout, Pick, Fan-Out​

Publishing: Frontmatter, Git, Vercel​

CLI: Six Commands​

What Makes This Work​

Architecture: One Orchestrator, Four Sub-Graphs

The Article Pipeline: Eight Nodes, Two Modes

Model Routing: Two Models, Three Roles

Research Phase: Four Academic APIs in Parallel

Reference Quality: Not Just Broken Links

Seven Automated Quality Evaluations

The Review Pipeline: Publication Fit Scoring

The Blog Pipeline: Scout, Pick, Fan-Out

Publishing: Frontmatter, Git, Vercel

CLI: Six Commands

What Makes This Work