2025-2026 SOTA TTS

2025-11-26 15:28 UTC gpt-5-1-thinking Open in ChatGPT ↗

Short answer: there isn’t one single 2025–2026 SOTA model, but there is a clear cluster of families that define the frontier right now:

Neural codec language models (VALL-E 2, Koel-TTS, etc.)
Diffusion / flow-matching TTS (NaturalSpeech 3, StyleTTS-ZS, etc.)
Speech language models & real-time “voice LLMs” (GPT-4o-mini-tts, Kyutai Moshi, etc.)
Industrial APIs (Azure Neural TTS, ElevenLabs, Play.ht, etc.)
Open-source leaders (CosyVoice2, FishSpeech 1.5, IndexTTS-2, etc.)

I’ll break it down by what “SOTA” usually means in practice.

1. What “SOTA naturalness” actually means in 2025–2026

Modern TTS is evaluated with:

MOS / CMOS / SMOS – subjective naturalness & similarity scores (1–5) from listeners.citeturn0search21turn0search33
Human-fooling rate (HFR) – how often people mistake synthetic speech for real speech. A 2025 study tested 10 state-of-the-art systems (5 open, 5 commercial) and reported surprisingly high fooling rates for the best models.citeturn0search5turn0search1
Objective metrics that correlate with human judgments (prosody, speaker similarity, intelligibility, etc.) used to benchmark ~35 systems from 2008–2024.citeturn0search2

Several current models hit “no statistically significant difference from natural speech” on standard read-speech datasets (LJSpeech, LibriTTS, etc.) in controlled tests.citeturn0search33turn0search7

So today “SOTA” usually means:

On curated benchmarks, listeners either can’t reliably tell synthetic from real speech, or prefer it only very slightly less often.

2. Model families that define SOTA right now

A. Neural codec language models (NCLMs) – zero-shot & human-parity

These work by:

Encoding audio into discrete tokens with a neural audio codec.
Treating speech generation as a language-modeling problem over those tokens.

Key SOTA-ish systems:

VALL-E 2 (Microsoft) – neural codec LM that explicitly claims human parity in zero-shot TTS on several benchmarks, with techniques like repetition-aware sampling and improved sequence modeling.citeturn0search19turn0search26
Koel-TTS (2025) – LLM-based codec TTS with preference optimization (RL-style) + classifier-free guidance to boost robustness and speaker similarity.citeturn0search18
Recent “robust neural codec LM” variants – LLaMA-style backbones plus phoneme-position prediction to fix robustness issues in long utterances & spontaneous speech.citeturn0search17

These systems are currently the top tier for zero-shot, prompt-based cloning: 2–5 seconds of reference audio is enough to get very high-fidelity speaker similarity and naturalness, including emotion and recording conditions.

B. Diffusion / flow-matching TTS – insanely natural prosody

Diffusion and related generative models dominate “pure quality”:

NaturalSpeech 3 (Microsoft) – factorized diffusion over neural codec tokens; it decomposes content, prosody, timbre, acoustic detail with factorized VQ and generates them with separate diffusion processes. It reports human-level quality and very strong zero-shot results.citeturn0search7turn0search4
End-to-End TTS with human-level quality – work by Tan et al. (2024) formalizes “human-level” via MOS statistics and shows their system matches natural speech under that criterion.citeturn0search33
StyleTTS-ZS (2025) – efficient zero-shot TTS that aims to match large codec-based models in naturalness and speaker similarity while being lighter and faster to run.citeturn0search9

If your metric is “how close to a high-quality studio recording does this sound?”, diffusion-style models and high-end neural codec LMs are the current SOTA.

C. Speech language models & real-time “voice agents”

Beyond pure TTS, there’s a shift to speech LMs that directly model speech tokens end-to-end:

SpeechLMs survey (2025) – summarises the trend of models that encode & generate speech tokens natively (not just text → speech), enabling speech-in / speech-out reasoning.citeturn0search11
GPT-4o-mini-tts and related OpenAI models – part of new speech-to-speech stacks released in 2025, enabling real-time voice assistants with integrated ASR + LLM + TTS.citeturn0search23
Kyutai Moshi – a “speech-text foundation model” for real-time, full-duplex (talk and listen simultaneously) dialogue, explicitly positioned as state-of-the-art for conversational voice agents.citeturn0search16turn0search22

These often sacrifice a tiny bit of studio-grade perfection for low latency and conversational dynamics (turn-taking, backchannels, interruptions), which is where 2025–2026 SOTA is moving.

3. Industrial APIs that are “SOTA in the wild”

Academic SOTA ≠ production SOTA, but several commercial APIs are consistently at the front, both in benchmarks and community tests:

A 2025 benchmarking study of open-source / commercial systems found Microsoft Speech (Azure Neural TTS) as the clearest overall leader when trading off latency and audio quality across many input lengths.citeturn0search24turn0search13
A 2025 large-scale HFR study looked at 10 top systems including open-source models and commercial offerings like ElevenLabs and Play.ht; the best systems fooled listeners at high rates in realistic conditions.citeturn0search5turn0search1
Recent “best TTS APIs in 2025” industry overviews consistently list:
- ElevenLabs
- Microsoft Azure Neural TTS
- OpenAI GPT-4o-mini-tts
- Play.ht
- top cloud offerings from Google & othersciteturn0search10turn0search23

If you just want the most natural-sounding voice in an API today, these are the usual contenders.

4. Open-source SOTA (what you can actually run or fine-tune)

Because many research models aren’t fully released, people often look at open-source leaders instead of strict academic SOTA.

A 2025 roundup of open-source speech models highlights:citeturn0search31turn0search34

CosyVoice2-0.5B
Fish-Speech v1.5
IndexTTS-2

as top choices balancing quality, latency, and ease of use.

On top of those:

Several implementations inspired by NaturalSpeech 3, StyleTTS-ZS, and neural codec LMs (VALL-E-style) give you near-SOTA prosody and naturalness, especially for English and Chinese, with good zero-shot cloning.

These open models won’t always match the very best proprietary systems on every metric, but they’re close enough that casual listeners often can’t tell.

5. Trends that characterize 2025–2026 SOTA

Across all these systems, the frontier is defined by:

Neural audio codecs + token LMs
- Audio → discrete tokens → big LM → tokens → waveform.
- Gives generalization (zero-shot speakers, accents, styles) and strong in-context learning.citeturn0search0turn0search19turn0search31
Factorization of speech attributes
- Separate modeling of content, prosody, timbre, noise / room (NaturalSpeech 3 style).citeturn0search7turn0search4
RL / preference optimization for audio
- DiffRO, preference-optimized TTS (e.g., Koel-TTS) aim to directly optimize what listeners prefer, not just likelihood.citeturn0search14turn0search18
New evaluation metrics
- Responsible evaluation frameworks formalizing subjective tests (CMOS, SMOS) and abuse-resistance.citeturn0search21
- Human-fooling-rate as a practical measure of “indistinguishability from human”.citeturn0search5turn0search1
Real-time, full-duplex interaction
- Models like Moshi and GPT-4o-based stacks optimize for latency, interruptions, barge-in, and natural conversational flow rather than just static samples.citeturn0search16turn0search23
On-device / edge deployment
- Work like Samsung’s Mamba-based streaming TTS shows SOTA-ish speech on constrained devices, with efficient neural codecs.citeturn0search30

6. If you tell me your use case, I’d pick differently

Because “SOTA” is task-dependent, you’d choose different families for:

Studio-grade narration / audiobooks → diffusion and high-end codec models (NaturalSpeech 3-style, top commercial APIs).
Zero-shot, prompt-based voice cloning → VALL-E-style codec LMs, Koel-TTS, strong proprietary APIs (ElevenLabs, etc.).
Real-time assistants → GPT-4o-mini-tts, Moshi, and similar speech-LM stacks.
Fully self-hosted → CosyVoice2, FishSpeech, IndexTTS-2, plus recent open implementations of StyleTTS-ZS / NaturalSpeech-3-like architectures.

If you tell me what you want to build (e.g., ultra-low-latency agent, cross-lingual cloning, long-form reading, on-device), I can give a very concrete short list of models and papers to start from.

Here’s a quick reality check on what you can actually use today from the models I mentioned — either as open code/weights or via a paid API.

1. Open-source (GitHub + usually weights)

These you can clone and run yourself:

Model / family	Public?	Notes
CosyVoice / CosyVoice2 / CosyVoice3	✅ GitHub + checkpoints	`FunAudioLLM/CosyVoice` repo includes multi-lingual CosyVoice, CosyVoice2 and updated CosyVoice3 with training + inference + deployment tools.citeturn0search8turn2academia21turn2academia19
CosyVoice2 ecosystem	✅	Extra repos like `Render-AI/CosyVoice2` and `CosyVoice2-EU` adapt CosyVoice2 for different setups / languages.citeturn0search0turn0search16turn0search24
Fish-Speech (incl. v1.5)	✅ GitHub + HF weights	`fishaudio/fish-speech` repo; Fish-Speech v1.5 weights on Hugging Face (`fishaudio/fish-speech-1.5`).citeturn0search1turn0search17
IndexTTS / IndexTTS-2	✅ GitHub	`index-tts/index-tts` (IndexTTS & IndexTTS-2) plus wrappers like `ComfyUI-IndexTTS2`.citeturn0search2turn0search10turn0search18
StyleTTS / StyleTTS2	✅ GitHub	Official repos `yl4579/StyleTTS` and `yl4579/StyleTTS2`; these are the main open baselines for the StyleTTS family. StyleTTS-ZS has a demo site, but I don’t see a clearly separate repo; it’s conceptually an extension of these.citeturn0search11turn0search19turn0search3
Moshi (Kyutai)	✅ GitHub	`kyutai-labs/moshi` provides a PyTorch implementation + server for the speech-native dialogue model (Mimi tokenizer + Moshi LM).citeturn0search7turn0search15
NaturalSpeech 3 components	⚠️ Partial	Full NS3 isn’t open, but the core FACodec (disentangled codec used in NS3) has public code & checkpoints in `naturalspeech3_facodec` / `FAcodec`.citeturn0search4turn0search12turn0search20
VALL-E-style models (unofficial)	✅ Several repos	Microsoft’s official VALL-E / VALL-E 2 are research-only, but there are many community implementations: `lifeiteng/vall-e`, `bxclib2/vall-e-2`, `ex3ndr/supervoice-vall-e-2`, `KubiakJakub01/Valle2`, `Multilingual-VALL-E-ZeroShot-TTS`, etc.citeturn3search18turn0search14turn0search6turn0search22turn3search6turn3search9turn3search3

Ambiguous / in-progress:

Koel-TTS – I don’t see a clearly labeled “Koel-TTS” repo yet. Koel Labs have open ML code in KoelLabs/ML, but the EMNLP 2025 Koel-TTS model itself looks research + demo only for now.citeturn0search5turn3search2turn3search5

2. Closed-source but available via paid API

These are not released as checkpoints, but easy to call as services:

Provider / model	Access	Notes
OpenAI GPT-4o mini TTS	💰 API	Exposed via the OpenAI Audio API (`gpt-4o-mini-tts`); text→speech with 11 built-in voices and promptable style control.citeturn1search3turn1search7
Azure Neural TTS (Azure AI Speech)	💰 API	Microsoft’s production TTS (what many benchmarks call “Microsoft Speech”). Exposed via REST + SDKs; pricing per million characters.citeturn1search4turn1search12turn1search0turn1search18
ElevenLabs	💰 API	Commercial TTS/STT platform with HTTP/WebSocket APIs, plus official Python and Node libraries.citeturn1search1turn1search9turn1search13turn1search22
Play.ht (PlayAI)	💰 API	Real-time HTTP + WebSocket TTS, with SDKs (`pyht` etc.), and API docs for voice cloning / streaming.citeturn1search2turn1search6turn1search17turn1search20turn1search23
Moshi via PiAPI	💰 API (waitlist)	PiAPI’s “Moshi API” exposes the Kyutai Moshi speech-native model as a hosted service; currently via signup / waitlist.citeturn0search23

Also, even though VALL-E 2 isn’t an API you can pick by name, its ideas have clearly influenced Microsoft’s production stack; Azure’s premium voices are roughly “VALL-E-era” in quality, but branded as Azure Neural TTS rather than “VALL-E 2”.citeturn3search11turn1search18

3. Mostly research-only / demo-only (no public model or API yet)

These are “paper + demo site” as of late 2025:

VALL-E 2 / VALL-E-R (Microsoft) – Papers + samples; no official code / weights. Only community re-implementations are public.citeturn3search3turn3search11turn3search4turn3search0turn3search22
Koel-TTS – EMNLP 2025 paper with demos, but not a clean, official GitHub release of the exact model yet.citeturn3search2turn3search12turn3search5
NaturalSpeech 3 (full system) – Only FACodec is released; the full multi-component NS3 generator is not.citeturn0search4turn0search12turn0search20
Samsung Mamba-based streaming TTS – Interspeech 2025 work on “Efficient Streaming TTS Acoustic Model with Depthwise RVQ Decoding in a Mamba Framework”; public paper + demos, but no official training/inference repo.citeturn2search0turn2search4turn2search10

TL;DR: if you want usable near-SOTA today

Self-host / hack on it → CosyVoice2/3, Fish-Speech v1.5, IndexTTS-2, StyleTTS2, Moshi.
Just call an API → OpenAI GPT-4o mini TTS, Azure Neural TTS, ElevenLabs, Play.ht, and (if you get in) PiAPI’s Moshi API.

If you tell me whether you care more about quality, latency, or cost / self-hosting, I can whittle this down to a 2–3-item shortlist and point at the exact repos / API docs you’d want to start from.

If you only care about speech quality / naturalness (not price or latency), my picks are:

Best paid: ElevenLabs
Best free/open: IndexTTS-2 (with a caveat about commercial use)
- Best permissively-licensed open model: StyleTTS / StyleTTS2

I’ll justify each.

Best paid for quality: ElevenLabs

Across multiple 2025 comparisons, ElevenLabs is consistently ranked as having the most human-like voices:

A 6-month test of 31 TTS tools concluded “ElevenLabs wins for voice quality but costs ~3× more than competitors.” citeturn0search4
A detailed voice-AI showdown (ElevenLabs vs PlayHT vs Microsoft TTS) rates ElevenLabs highest in audio realism and explicitly states:

“Verdict: ElevenLabs wins in realism and latency… PlayHT is a great runner-up… Microsoft excels at customization.” citeturn1view0
Another 2025 guide calls ElevenLabs “often considered the gold standard for clarity and expressive voices across many languages.” citeturn0search20
Independent OpenAI vs ElevenLabs comparisons generally find OpenAI is cheaper, ElevenLabs sounds more natural. citeturn0search11turn0search3

Why I’d call it #1 paid for quality:

Prosody & emotion: Very strong emotional nuance (whispers, laughter, intensity) and “cinematic” delivery.
Voice cloning quality: High-fidelity cloning with ~1 minute of audio.
Multilingual: High quality in many languages, not just English.

OpenAI’s GPT-4o mini TTS and Azure Neural TTS are extremely good and may beat ElevenLabs on cost or integration, but if your single metric is “which sounds most like a skilled human voice actor?”, current independent reviews put ElevenLabs slightly ahead.

If you want, I can also sketch a quick “quality-focused” setup using ElevenLabs (voices, parameters, SSML tricks, etc.).

Best free / open for quality: IndexTTS-2

Among modern open-source models, IndexTTS-2 currently looks like the quality leader on published benchmarks.

The 2025 IndexTTS paper compares against the main open TTS heavyweights (CosyVoice2, Fish-Speech, F5-TTS, XTTS…) and reports higher MOS scores for perceptual quality and speaker similarity: citeturn3view1

Example MOS table (zero-shot cloned voice, 1–5 scale, higher is better):

Model	Prosody	Timbre	Quality	Avg
CosyVoice2	3.67	4.05	3.73	3.81
F5-TTS	3.56	3.88	3.56	3.66
Fish-Speech	3.40	3.63	3.69	3.57
XTTS	3.23	2.99	3.10	3.11
IndexTTS	3.79	4.20	4.05	4.01

The authors explicitly conclude that IndexTTS “outperforms current popular TTS systems such as XTTS, CosyVoice2, Fish-Speech, and F5-TTS” across quality and similarity metrics. citeturn3view1

The GitHub repo for IndexTTS / IndexTTS-2 provides:

Full code + weights
Web UI and Python API
HuggingFace and ModelScope checkpoints

and describes it as an “industrial-level controllable and efficient zero-shot TTS system” with strong emotional control. citeturn4view0

Caveat:
The repo clearly says:

“For commercial usage and cooperation, please contact indexspeech@bilibili.com.” citeturn4view0

So it’s free to download and run, but not a simple “do whatever you want” license. For personal/research use, it’s fantastic; for commercial products, you’d need to check terms or contact them.

If you need truly free & permissive: StyleTTS / StyleTTS2

If by “free” you also mean permissive license suitable for commercial use, then the best quality vs license balance is usually:

StyleTTS / StyleTTS2

A 2025 overview of open-source TTS libraries highlights that:

Fish-Speech v1.5 has excellent quality but restrictive commercial licensing.
StyleTTS is noted for “exceptionally natural-sounding English speech with a permissive license.” citeturn2view0

So for English and commercial friendliness, I’d rank:

StyleTTS2 – most “human-sounding” among permissively-licensed models.
xtts-v2 / CosyVoice2 – excellent multilingual zero-shot, but licensing and ease-of-deployment vary.
Fish-Speech – great quality, but license is more restrictive for commercial use.

Short practical answer

Best paid for raw naturalness:
👉 ElevenLabs (use their “Premium” / “Professional” voices + VoiceLab for max quality).
Best free for raw naturalness (research / non-commercial):
👉 IndexTTS-2 (SOTA MOS vs CosyVoice2, Fish-Speech, F5-TTS, etc.).
Best free and broadly permissive (esp. for English):
👉 StyleTTS2.

If you tell me your language(s) and whether you need commercial rights, I can give you a very concrete “use this one model + this config” recommendation.

Short version:

For pure perceived voice quality / realism, especially across many languages & long-form content, ElevenLabs is still ahead.
For ultra-low latency, cheaper real-time agents, and developer knobs, Cartesia is very strong and sometimes preferable.

You said you care about quality, so I’ll focus there and then note where Cartesia wins.

1. Voice quality & naturalness

What independent comparisons say

Fish Audio’s 2025 head-to-head:

“ElevenLabs is widely recognized for having the most realistic, human-like voice quality in the industry across 70+ languages. Cartesia offers excellent quality optimized for speed and real-time use cases.” citeturn1view2
Murf’s long Cartesia-vs-ElevenLabs review (Nov 2025) is similar: ElevenLabs is the benchmark for hyper-realistic speech, whereas Cartesia emphasizes speed and flexible deployment. citeturn2view0
A Cartesia vs OpenAI/ElevenLabs comparison on Fish Audio also concludes: use Cartesia for low-latency agents, ElevenLabs for premium quality across diverse content creation. citeturn1view2

Cartesia’s own claim

Cartesia’s official comparison page says their Sonic-2 model was preferred over ElevenLabs Flash v2 in their internal blinded tests (61% vs 39%). citeturn3view0 That’s worth noting, but it’s:

Comparing one Cartesia model vs Eleven’s Flash (fast) tier, not Eleven’s highest-quality models.
Vendor-run, so you’d want to verify with your own listening tests.

My read:
If you line up best-quality models from each platform and listen to narration / character work:

ElevenLabs generally has deeper, more “actor-like” prosody and emotional nuance, especially in non-English languages and long passages.
Cartesia sounds very good—much better than most open-source—but often a bit more “agent / assistant” and neutral in style, optimized for clarity and responsiveness.

For your criterion “who wins on sheer quality?”: ElevenLabs.

2. Languages & long-form content

Languages
- ElevenLabs: ~70+ languages supported in their latest models. citeturn1view0
- Cartesia Sonic: about 15 languages, with good accent/localization tools but narrower coverage. citeturn2view0turn0search7
Community feedback: Cartesia’s English is excellent, but non-English quality trails ElevenLabs a bit. citeturn0search10
Long-form text
- ElevenLabs Flash v2.5: up to 40k characters per request, and you can stitch requests for audiobooks / long scripts. citeturn1view0
- Cartesia Sonic: ~500 characters per request (for some Sonic configs), so you have to chunk text and manage prosody transitions yourself. citeturn1view0turn3view0

If you care about audiobooks, 10–30 minute narrations, or multi-language projects, ElevenLabs is clearly more mature.

3. Latency & “conversation feel”

Where Cartesia really shines is real-time agents:

Cartesia markets Sonic/Sonic-3 as having sub-90ms time-to-first-audio, with model latency as low as 40 ms, targeted at real-time voice agents. citeturn0search2turn0search23turn3view0
Fish Audio’s benchmark summarizes it as:

“Cartesia AI offers… industry-leading sub-90ms latency optimized for real-time conversational AI.” citeturn1view2
A separate independent test (Dhruv Mehra, 75+ calls) measured TTFB: Cartesia ~164 ms vs ElevenLabs ~430 ms on average in that setup. citeturn0search9

ElevenLabs’ own docs quote ~75 ms latency for Flash v2.5, ~95 ms for Cartesia, measured in their environment. citeturn1view0 So benchmark results are a bit mixed depending on model tier, region, and integration.

Bottom line on latency:

If you’re building snappy, barge-in-friendly voice agents, Cartesia is at least as competitive as ElevenLabs and often faster in real-world tests.
If you’re doing offline content (audiobooks, videos, podcasts), an extra 100–200 ms doesn’t matter, so ElevenLabs’ quality advantage is more important.

4. Voice cloning & controls

Cloning

Cartesia claims instant cloning from 3 seconds of audio, and good robustness to noisy recordings. citeturn3view0turn0search21
ElevenLabs offers:
- Instant voice cloning with <1 minute of audio; and
- Professional Voice Cloning (30+ min of clean data) for near-indistinguishable clones. citeturn1view0

Independent reviewers still generally say ElevenLabs clones sound closer to the original speaker in expressive scenarios, especially when you give it more data. citeturn1view2turn2view0

Controls

Cartesia: strong emotion/speed dials, prosody tags, and is marketed as uniquely good at fine-grained voice design (especially for agents and telephony). citeturn3view0turn0search18
ElevenLabs: style exaggeration, stability/similarity, emotion sliders, SSML-like controls, plus speech-to-speech and dubbing. citeturn2view0turn1view0

Both give you decent control; Cartesia leans more “developer-tweakable agent voice,” ElevenLabs leans “director-tweakable narrator/actor.”

5. Pricing & “value per quality”

From Fish Audio’s price comparison (per character): citeturn1view2

Cartesia ≈ $0.00004 / character
ElevenLabs ≈ $0.00014 / character

So Cartesia is roughly 70–75% cheaper at scale.

Rule of thumb:

High-volume, real-time contact center / agents → Cartesia often gives the best quality per dollar.
Premium content where “this has to sound human” is the only goal → ElevenLabs is worth the extra cost.

6. Net summary for you (since you care about quality)

If I translate all of that into a simple recommendation:

If your top priority is maximum perceived naturalness and emotional nuance, especially in many languages or long recordings → use ElevenLabs.
- Better multilingual coverage
- Better long-form handling
- Widely reviewed as the most human-like overall
If you’re building real-time, high-volume voice agents and “very good but slightly less hyper-real” is fine → seriously consider Cartesia.
- Excellent latency and pricing
- Strong English quality and cloning
- Great for interactive apps, call centers, assistants

Practically, both have a free tier. The best move is to generate the same few scripts (calm narration, emotional dialogue, and “agent-y” responses) on both and A/B test by ear – but going in, I’d expect ElevenLabs to sound better overall and Cartesia to feel snappier and cheaper for live agent stuff.