Modern neural TTS
One claim about neural TTS deserves more scrutiny than it gets: that you need massive datasets to get good results. The 2023 VALL-E paper cemented this assumption in many minds — 50,000 hours of English speech for zero-shot voice cloning. Then, less than a year later, Sample-Efficient Diffusion (2024) comes along with 2% of that data and beats VALL-E on intelligibility. That’s not incremental progress. That’s a direct challenge to an entire research direction. Let’s look at what actually works, what doesn’t, and why the conventional wisdom about neural TTS is due for a rewrite.
The Zero-Shot Mirage and the Data Efficiency Reckoning
VALL-E (2023) treats TTS as a language modeling problem over discrete audio codes from a neural codec. The claim is seductive: give it a 3-second clip of an unseen speaker, and it synthesizes coherent speech in that voice. The model achieves state-of-the-art zero-shot performance—but only after digesting 50,000 hours of audio. That’s an industrial-scale data pipeline most teams cannot replicate.
Sample-Efficient Diffusion (2024) flips the script. Using a latent diffusion architecture built on a U-Audio Transformer and trained on less than 1,000 hours of speech, it produces more intelligible output than VALL-E. Let that sink in: less than 2% of the data, better results. The secret is not magic—it’s a latent-space representation that decouples content and timbre more effectively than VALL-E’s discrete codec embeddings. The U-Audio Transformer scales efficiently to long sequences, avoiding the quadratic memory costs of full self-attention.
For any engineer evaluating a TTS system today: ignore the headline “zero-shot” and ask how much data was used to train the acoustic backbone. If the answer is more than 5,000 hours, you’re paying for a caching problem that better architecture design could have solved.
Parallel Generation: Speed at What Cost?
When FastSpeech (2019) proposed a non-autoregressive feed-forward Transformer for mel-spectrogram generation, it was a watershed moment for latency. The model extracted attention alignments from a teacher encoder-decoder and sidestepped the sequential decoding that plagued Tacotron 2. Claimed speed-up: 270×. No skipped words, no repetition errors, and you could adjust speaking rate with a single scalar.
Yet there’s a reason VALL-E and DiTAR (2025) still lean on autoregressive language models. FastSpeech’s parallel generation imposes a hard constraint: prosody is determined by the duration predictor and the variance adaptor, which are trained to match the teacher’s forced alignments. If the teacher’s prosody is flat, the student inherits that flatness. You cannot easily inject expressive variations that weren’t present in the teacher’s distribution.
DiTAR (2025) offers a middle path: patch-based autoregressive modeling. It aggregates speech into patches, processes them with an AR language model to capture long-range dependencies, then uses a diffusion transformer to fill in each patch. The computational load drops because patches are coarser than raw waveform frames, yet the model retains the flexibility to generate continuous speech with fine-grained prosody. It’s not as fast as FastSpeech, but it doesn’t need to be—it solves a different problem: controllable, long-form synthesis without the data budget of VALL-E.
Prosody: The Battle Between Interpretability and Power
Controlling prosody has always been the most human part of TTS. Do you want to manipulate F0 and duration explicitly, or let the model infer latent style factors?
Robust and Fine-grained Prosody Control (2018) introduced temporal structures in prosody embeddings at frame and phoneme levels. You can adjust pitch and amplitude with surgical precision, but you need to know what to adjust. The system learns these embeddings without extra supervision—an impressive trick—but the output remains bound to the training distribution.
Ctrl-P (2021) takes the explicit route: condition on F0, energy, and duration, either from ground truth or a predictor. This gives multiple renditions of the same text, but the features themselves are noisy and expensive to extract at inference.
At the other extreme, word-level text markup (2024) encodes prosodic knowledge into a latent quantized space using self-supervised pretrained models. No hand-labeling required. The markup becomes an interpretable additional input to the TTS model. This is far more practical than ToBI-based approaches (Fine-Grained Prosody Modeling Using ToBI Representation, 2021), which require expert linguists to label stress and intonation. ToBI works well for pitch-stressed languages like English, but it’s brittle across domains and languages.
The tension is clear: explicit control gives you knobs, but you might not know which knobs to turn. Latent discovery gives you power, but you cannot easily answer “why did the model make that pause there?” For production systems, I prefer the hybrid approach: start with a latent markup system for baseline naturalness, then layer explicit controls (pitch shift, speaking rate) for post-hoc adjustments.
Diffusion TTS: Sampling Cost Is the Real Tax
Guided-TTS (2021) combined an unconditional diffusion probabilistic model with a separately trained phoneme classifier for classifier guidance. The trick of normalizing the classifier gradient by its norm reduced pronunciation errors significantly. The quality was competitive with Grad-TTS without needing target speaker transcripts during training. But the sampling process required hundreds of iterative denoising steps—far too slow for real-time applications.
Fast Grad-TTS (2022) tackled this head-on, comparing progressive distillation, GAN-based diffusion, and latent score-based samplers to accelerate reverse diffusion on CPUs. The distilled models run in a fraction of the steps, but every acceleration technique introduces trade-offs: distillation loses high-frequency detail, GANs introduce artifacts, latent score models need careful tuning.
DiTAR’s hybrid approach sidesteps the sampling bottleneck: the diffusion transformer only operates on patches, not full-length sequences. Fewer steps, lower memory. The autoregressive language model handles the global structure, and the diffusion model refines local details. This division of labor makes long-form generation tractable. For anyone deploying a TTS system today, I’d bet on hybrid models over pure diffusion or pure AR—they give you more levers for quality-latency optimization.
Evaluation: MOS Is Not Good Enough
The default evaluation metric for neural TTS remains the Mean Opinion Score (MOS), often collected via crowdsourcing. The 2021 survey on neural speech synthesis catalogs this practice across hundreds of papers. But MOS is a blunt instrument. It conflates naturalness, intelligibility, and speaker similarity into a single number. Worse, raters’ opinions are heavily biased by the first few seconds of audio.
The research brief points to two improvements: Rapid Prosody Transcription (RPT) and the use of multiple speech representations in quality estimators. RPT shifts the evaluation from overall quality to pinpointing exact prosodic errors—did the stress fall on the wrong syllable? Was the pause too long? This granularity is essential for diagnosing failures, not just ranking systems.
Automatic quality prediction requires at least 8 different representations (spectrograms, x‑vectors, pitch contours, etc.). The brief specifically recommends including x-vectors that model T60 reverberation time—because room acoustics heavily influence perceived quality, even in anechoic synthetic speech. If you’re building a voice cloning pipeline, measure both MOS and speaker similarity, but also run an RPT analysis on the worst-performing sentences. That’s where the real bugs live.
Practical Decision Framework
Based on the evidence from these papers, here’s how I choose a TTS approach for a project:
- If you have under 1k hours of target-domain data and need intelligibility over raw expressiveness: Use Sample-Efficient Diffusion (latent U-Audio Transformer). You’ll get better word comprehension than any discrete-codec model trained on 50k hours.
- If you need zero-shot voice cloning from short samples and can budget >5k hours of pretraining data: VALL-E or similar codec language models are viable, but expect higher computational costs and potential artifacts on unseen accents.
- If latency is critical (sub-100ms per utterance): FastSpeech-based parallel generation with a lightweight variance adapter. Accept that prosody may be bland.
- If you need expressive long-form speech with controllable prosody: Use a hybrid autoregressive-diffusion system like DiTAR or a latent markup model (word-level text markup + a diffusion vocoder).
- For evaluation: Always supplement MOS with RPT and an automatic quality estimator that includes at least eight feature types. Measure intelligibility under noise (e.g., vocal effort modeling from 2022 paper) if your deployment environment includes masking noise.
One more trick from the literature: When applying classifier guidance in a diffusion TTS, normalize the classifier gradient by its norm. Guided-TTS (2021) showed that this simple operation drastically reduces pronunciation errors. Implement it before you tune the sampling schedule.
The Bigger Picture
The field is fragmenting into two camps. The first says “more data, bigger models, discrete codes.” The second says “smarter latent representations, less data, diffusion.” The Sample-Efficient Diffusion result suggests the second camp has a strong advantage for cost-sensitive production environments. But the VALL-E camp has latched onto zero-shot generalization, which remains elusive for diffusion models—at least with current architectures.
DiTAR (2025) hints at a convergence: patch-based AR models handle the long-range dependencies that diffusion alone struggles with, while diffusion refines local quality that pure AR models muddle. I expect this hybrid pattern to dominate the next generation of production TTS systems.
If you’re building a speech product in 2025, ignore the hype around data scale. Start with a diffusion backbone trained on under 1k hours. Add AR patches for prosody. Use latent markup for controllability. Evaluate with RPT. And never trust an MOS score without reading the listening test design. The rest is engineering.
