Streaming OpenAI TTS to Cloudflare R2
This article documents a production implementation of OpenAI's Text-to-Speech (TTS) API with automatic chunking for long-form content and seamless upload to Cloudflare R2 storage.
Architecture Overview
The system provides two API entrypoints for audio generation:
- GraphQL Mutation (
generateOpenAIAudio) — used by the main app for story audio - REST API (
/api/tts) — provides flexible streaming options and direct upload
Both endpoints support:
- Automatic text chunking for content exceeding 4000 characters
- Audio merging for seamless playback of long content
- Cloudflare R2 upload with public CDN URLs
- Base64 fallback for immediate playback while uploading
- Metadata tracking (duration, voice, model, etc.)
OpenAI TTS Integration
Voice Selection
Defaults to onyx but supports all OpenAI TTS voices:
- alloy, ash, ballad, coral, echo, fable
- onyx (default), nova, sage, shimmer
- verse, marin, cedar
Model Selection
Supports three models:
gpt-4o-mini-tts(default) — fast, efficient, high qualitytts-1— standard qualitytts-1-hd— high definition audio
Audio Formats
Supports multiple output formats:
- mp3 (default) — best compatibility
- opus, aac, flac, wav, pcm
Text Chunking for Long Content
OpenAI TTS has a 4096 character limit. This implementation uses Mastra RAG’s recursive chunking strategy to intelligently split long text.
import { MDocument } from "@mastra/rag";
const MAX_CHARS = 4000; // Buffer below OpenAI's 4096 limit
async function chunkTextForSpeech(text: string): Promise<string[]> {
const doc = MDocument.fromText(text);
const chunks = await doc.chunk({
strategy: "recursive",
maxSize: MAX_CHARS,
overlap: 50,
separators: ["\n\n", "\n", ". ", "! ", "? "],
});
return chunks.map((chunk) => chunk.text);
}
Key features:
- Respects paragraph breaks (
\n\n) - Falls back to sentence boundaries (
.,!,?) - 50-character overlap prevents awkward breaks
- Maintains narrative flow across chunks
Audio Merging
When text is chunked, each piece is converted to audio separately, then merged into a single file:
// Generate audio for each chunk
const audioChunks: Buffer[] = [];
for (const chunk of chunks) {
const response = await openai.audio.speech.create({
model: "gpt-4o-mini-tts",
voice: "onyx",
input: chunk,
response_format: "mp3",
speed: 0.9,
});
const buffer = Buffer.from(await response.arrayBuffer());
audioChunks.push(buffer);
}
// Combine into single audio file
const combined = Buffer.concat(audioChunks);
Why merge?
- Single file = simpler playback
- No gaps between chunks
- Easier to upload and share
- Better browser compatibility
Cloudflare R2 Upload
R2 Client Setup
Uses AWS SDK v3 with Cloudflare R2 endpoints:
import { S3Client, PutObjectCommand } from "@aws-sdk/client-s3";
const r2Client = new S3Client({
region: "auto",
endpoint: `https://${R2_ACCOUNT_ID}.r2.cloudflarestorage.com`,
credentials: {
accessKeyId: R2_ACCESS_KEY_ID,
secretAccessKey: R2_SECRET_ACCESS_KEY,
},
});
Upload Function
export async function uploadToR2(options: {
key: string;
body: Buffer;
contentType?: string;
metadata?: Record<string, string>;
}): Promise<{
key: string;
publicUrl: string | null;
bucket: string;
sizeBytes: number;
}> {
const { key, body, contentType = "audio/mpeg", metadata = {} } = options;
await r2Client.send(
new PutObjectCommand({
Bucket: R2_BUCKET_NAME,
Key: key,
Body: body,
ContentType: contentType,
Metadata: metadata,
}),
);
const publicUrl = R2_PUBLIC_DOMAIN ? `${R2_PUBLIC_DOMAIN}/${key}` : null;
return {
key,
publicUrl,
bucket: R2_BUCKET_NAME,
sizeBytes: body.length,
};
}
Key Generation
Unique keys with timestamps prevent collisions:
export function generateAudioKey(prefix?: string): string {
const timestamp = Date.now();
const random = Math.random().toString(36).substring(2, 15);
return `${prefix ? `${prefix}/` : ""}audio-${timestamp}-${random}.mp3`;
}
Example output: graphql-tts/audio-1707839472156-9k2jd8x4.mp3
Metadata Extraction
The system calculates audio duration using music-metadata:
import { parseBuffer } from "music-metadata";
let duration: number | null = null;
try {
const metadata = await parseBuffer(audioBuffer, {
mimeType: "audio/mp3",
});
duration = metadata.format.duration || null;
} catch (error) {
console.warn("Failed to parse audio duration:", error);
}
Metadata stored with upload:
voice— TTS voice usedmodel— OpenAI modeltextLength— original text lengthchunks— number of chunks (if split)generatedBy— user emailinstructions— custom TTS instructions (optional)
GraphQL Implementation
Schema Definition
input GenerateOpenAIAudioInput {
text: String!
storyId: Int
voice: OpenAITTSVoice
model: OpenAITTSModel
speed: Float
responseFormat: OpenAIAudioFormat
uploadToCloud: Boolean
instructions: String
}
type GenerateOpenAIAudioResult {
success: Boolean!
message: String
audioBuffer: String
audioUrl: String
key: String
sizeBytes: Int
duration: Float
}
type Mutation {
generateOpenAIAudio(input: GenerateOpenAIAudioInput!): GenerateOpenAIAudioResult!
}
Resolver Implementation
export const generateOpenAIAudio: MutationResolvers["generateOpenAIAudio"] = async (
_parent,
args,
ctx,
) => {
const userEmail = ctx.userEmail;
if (!userEmail) {
throw new Error("Authentication required");
}
const {
text,
storyId,
voice = "ONYX",
model = "GPT_4O_MINI_TTS",
speed = 0.9,
responseFormat = "MP3",
uploadToCloud,
} = args.input;
// Map GraphQL enums to OpenAI API values
const openAIVoice = voice.toLowerCase();
const openAIModel = model === "GPT_4O_MINI_TTS" ? "gpt-4o-mini-tts" : "tts-1";
const format = responseFormat.toLowerCase();
// Handle chunking if needed
if (text.length > MAX_CHARS) {
const chunks = await chunkTextForSpeech(text);
const audioChunks: Buffer[] = [];
for (const chunk of chunks) {
const response = await openai.audio.speech.create({
model: openAIModel,
voice: openAIVoice,
input: chunk,
response_format: format,
speed,
});
audioChunks.push(Buffer.from(await response.arrayBuffer()));
}
const combined = Buffer.concat(audioChunks);
// Upload to R2
if (uploadToCloud) {
const key = generateAudioKey("graphql-tts");
const result = await uploadToR2({
key,
body: combined,
contentType: `audio/${format}`,
metadata: {
voice: openAIVoice,
model: openAIModel,
textLength: text.length.toString(),
chunks: chunks.length.toString(),
generatedBy: userEmail,
},
});
// Save to story if provided
if (storyId) {
await saveAudioToStory(storyId, result.key, result.publicUrl, userEmail);
}
return {
success: true,
message: `Audio generated from ${chunks.length} chunks and uploaded to R2`,
audioBuffer: combined.toString("base64"),
audioUrl: result.publicUrl,
key: result.key,
sizeBytes: result.sizeBytes,
duration: audioDuration,
};
}
return {
success: true,
message: `Audio generated from ${chunks.length} chunks`,
audioBuffer: combined.toString("base64"),
audioUrl: null,
key: null,
sizeBytes: combined.length,
duration: audioDuration,
};
}
// ... handle short text similarly
};
Client Usage
import { useGenerateOpenAiAudioMutation } from "@/app/__generated__/hooks";
const [generateAudio, { loading }] = useGenerateOpenAiAudioMutation();
async function handleTextToSpeech() {
const result = await generateAudio({
variables: {
input: {
text: storyContent,
storyId,
voice: OpenAittsVoice.Onyx,
model: OpenAittsModel.Gpt_4OMiniTts,
speed: 0.9,
responseFormat: OpenAiAudioFormat.Mp3,
uploadToCloud: true,
},
},
});
const payload = result.data?.generateOpenAIAudio;
if (payload?.success && payload.audioUrl) {
// Use public R2 URL
setAudioSrc(payload.audioUrl);
} else if (payload?.audioBuffer) {
// Fallback to base64
const blob = base64ToBlob(payload.audioBuffer);
setAudioSrc(blob);
}
}
REST API Implementation
Streaming Options
The /api/tts endpoint supports multiple streaming modes:
export async function POST(request: NextRequest) {
const {
text,
voice = "alloy",
uploadToCloud,
streamFormat, // "sse" for Server-Sent Events
instructions,
} = await request.json();
// For short text with SSE streaming
if (streamFormat === "sse" && !uploadToCloud) {
const response = await openai.audio.speech.create({
model: "gpt-4o-mini-tts",
voice,
input: text,
response_format: "mp3",
speed: 0.9,
stream_format: "sse",
});
// Create SSE stream
const stream = new ReadableStream({
async start(controller) {
const reader = response.body?.getReader();
if (!reader) {
controller.close();
return;
}
while (true) {
const { done, value } = await reader.read();
if (done) break;
controller.enqueue(value);
}
controller.close();
},
});
return new NextResponse(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}
// Standard audio stream
const response = await openai.audio.speech.create({
model: "gpt-4o-mini-tts",
voice,
input: text,
response_format: "mp3",
speed: 0.9,
});
return new NextResponse(response.body, {
headers: {
"Content-Type": "audio/mp3",
"Cache-Control": "no-cache",
"Transfer-Encoding": "chunked",
},
});
}
Client Usage (Fetch API)
// Upload to R2
const response = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: storyContent,
voice: "onyx",
uploadToCloud: true,
storyId: 123,
userEmail: "user@example.com",
}),
});
const data = await response.json();
// { success: true, audioUrl: "https://tts.yourdomain.com/audio-123456.mp3", key: "...", sizeBytes: 234567 }
// Direct streaming
const response2 = await fetch("/api/tts", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
text: "Hello world",
voice: "onyx",
}),
});
const blob = await response2.blob();
const audioUrl = URL.createObjectURL(blob);
audioElement.src = audioUrl;
Database Integration
Audio metadata is saved to the database when a storyId is provided:
async function saveAudioToStory(
storyId: number,
audioKey: string,
audioUrl: string | null,
userEmail: string,
): Promise<void> {
const now = new Date().toISOString();
await turso.execute({
sql: `UPDATE stories
SET audio_key = ?, audio_url = ?, audio_generated_at = ?, updated_at = ?
WHERE id = ? AND user_id = ?`,
args: [audioKey, audioUrl || "", now, now, storyId, userEmail],
});
}
Database schema (SQLite):
CREATE TABLE stories (
id INTEGER PRIMARY KEY,
goal_id INTEGER NOT NULL,
user_id TEXT NOT NULL,
content TEXT NOT NULL,
audio_key TEXT,
audio_url TEXT,
audio_generated_at TEXT,
created_at TEXT NOT NULL,
updated_at TEXT NOT NULL
);
Configuration
Environment Variables
# OpenAI
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
# Cloudflare R2
R2_ACCOUNT_ID=your-account-id-here
R2_ACCESS_KEY_ID=your-access-key-id-here
R2_SECRET_ACCESS_KEY=your-secret-access-key-here
R2_BUCKET_NAME=longform-tts
R2_PUBLIC_DOMAIN=https://tts.yourdomain.com
R2 Bucket Setup
-
Create R2 Bucket: In Cloudflare dashboard, create bucket named
longform-tts -
Generate API Token: Create an API token with R2 read/write permissions
-
Configure Custom Domain (optional):
- Add custom domain in R2 bucket settings
- Point DNS to R2 bucket URL
- Enable public access for the bucket
-
CORS Configuration (if accessing from browser):
{
"AllowedOrigins": ["https://your-app.com"],
"AllowedMethods": ["GET", "HEAD"],
"AllowedHeaders": ["*"],
"ExposeHeaders": ["Content-Length", "Content-Type"],
"MaxAgeSeconds": 3600
}
Performance Considerations
Text Chunking
- 4000 char limit: safe buffer below OpenAI’s 4096
- 50 char overlap: prevents awkward sentence breaks
- Recursive splitting: maintains natural paragraph/sentence flow
