Mistral Voxtral TTS: Open-Weight Model Beats ElevenLabs at 90ms Latency
Quick summary
Mistral released Voxtral-4B-TTS on March 26, 2026. 4B parameters, open weights, 90ms time-to-first-audio, 68.4% win rate vs ElevenLabs. At $0.016 per 1,000 chars it changes the TTS pricing floor.
Read next
- Andrej Karpathy's 630-Line Python Script That Does AI Research ItselfKarpathy released AutoResearch: 630 lines of Python where AI agents design, run, and interpret ML experiments with no human in the loop.
- Gemma 4 (April 2026): #3 Arena Open LLM, Apache 2, Developer GuideGemma 4 April 2, 2026: Google open weights on Gemini 3, 400M+ Gemma downloads, 31B #3 Arena open LLM, Apache 2. E2B–31B sizes, Ollama, vLLM, Vertex.
Mistral released Voxtral-4B-TTS-2603 on March 26, 2026. It is a 4-billion-parameter text-to-speech model with open weights (CC BY NC 4.0 license), 90ms time-to-first-audio, support for 9 languages, zero-shot voice cloning from 3 seconds of reference audio, and API pricing of $0.016 per 1,000 characters. In head-to-head evaluation, it achieves a 68.4% win rate against ElevenLabs Flash v2.5 in multilingual voice cloning. For developers building voice applications, this changes the cost and deployment calculus significantly.
What Voxtral Actually Does
Voxtral is a streaming TTS model — it generates and outputs audio tokens in real time rather than generating the full audio before playback begins. The 90ms time-to-first-audio means a user hears the first audio chunk within 90 milliseconds of submitting text. For reference: ElevenLabs Flash v2.5 targets under 75ms, and OpenAI TTS-1 runs at approximately 150-300ms depending on text length.
The voice cloning capability is notable. Zero-shot cloning requires no fine-tuning — you provide 3 seconds of reference audio and Voxtral matches the speaker's characteristics for new text. Few-shot cloning with slightly more reference audio improves quality further. This matches or exceeds what was possible only with proprietary APIs six months ago.
Language support: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The multilingual quality is where Voxtral outperforms on the 68.4% win rate — ElevenLabs has historically been stronger in English than in non-English languages, and Voxtral specifically targets that gap.
The Pricing Breakdown
$0.016 per 1,000 characters via the Mistral API. To put this in context:
| Provider | Price per 1K chars | Latency |
|---|---|---|
| Mistral Voxtral | $0.016 | 90ms |
| ElevenLabs Flash v2.5 | $0.030 | <75ms |
| OpenAI TTS-1 | $0.015 | 150-300ms |
| OpenAI TTS-1-HD | $0.030 | 150-300ms |
| ElevenLabs Turbo v2.5 | $0.011 | ~300ms |
Voxtral sits in the middle of the pricing distribution but at significantly higher quality than comparable-price options. For a voice AI application sending 10 million characters per month, the difference between Voxtral and ElevenLabs Flash is approximately $1,400/month ($1,600 vs $3,000). At scale, that differential compounds.
The self-hosting option matters even more for cost. Voxtral's 4B parameter model runs on consumer hardware — Mistral says modern laptops can handle it. In practice, a single A100 GPU can serve real-time TTS at reasonable concurrency for an internal application. For developers who already operate GPU infrastructure, the marginal cost of running Voxtral approaches zero beyond electricity.
Open Weights Changes the Deployment Calculus
ElevenLabs, Deepgram, and the OpenAI TTS API are all closed — you send your text to their servers and receive audio back. This creates three problems for certain applications: latency (network round-trip added to model inference time), data privacy (your text leaves your infrastructure), and vendor lock-in (pricing and availability controlled by the provider).
Voxtral's open weights change all three. Self-hosted deployment eliminates the network round-trip, keeps text on your own infrastructure, and removes vendor dependency. The CC BY NC 4.0 license allows free use for non-commercial purposes and requires a separate commercial license for commercial self-hosting — but the Mistral API option handles commercial use at the published pricing.
For applications where audio content is sensitive — medical transcription voice interfaces, legal document readers, internal enterprise tools — self-hosting is often not optional regardless of cost. Voxtral is the first frontier-quality open-weight TTS model that makes this practical on consumer-grade hardware.
How It Compares to Kokoro and Other Open TTS Models
The open-source TTS landscape before Voxtral: Kokoro-82M (82 million parameters, good quality in English, limited multilingual), StyleTTS2 (English-focused, high quality but slower), XTTS v2 (Coqui, multilingual but lower quality than ElevenLabs). Voxtral at 4B parameters is an order of magnitude larger than any previous open-weight TTS model with comparable accessibility.
The benchmark methodology matters here. The 68.4% win rate is from Mistral's own evaluation in multilingual voice cloning — not an independent benchmark. For English-only applications, the gap between Voxtral and ElevenLabs Flash v2.5 may be narrower than 68.4% implies. For multilingual applications, particularly those involving Hindi or Arabic, Voxtral's advantage is likely real given ElevenLabs' historical English focus.
What Developers Should Build With This
Three application categories where Voxtral changes what is economically viable:
Real-time voice assistants at consumer scale. The 90ms latency is at the boundary of what feels instantaneous in a conversational interface. Combined with a fast LLM response (Claude Haiku or Gemini Flash), a full voice AI pipeline — user speech in, AI text generated, Voxtral audio out — can run end-to-end in under 500ms at reasonable cost.
Multilingual podcast and content generation. Voxtral's 9-language support with voice cloning enables content localization workflows that previously required separate model calls to different specialized providers per language. One model, one API call, 9 languages.
Privacy-sensitive enterprise TTS. Legal, medical, and financial applications that cannot send text to third-party APIs now have a production-quality self-hostable option. Prior to Voxtral, the quality gap between open-source TTS and ElevenLabs was too large for customer-facing applications.
Key Takeaways
- Voxtral-4B-TTS-2603 released March 26, 2026 — 4B parameters, open weights (CC BY NC 4.0), available on Hugging Face
- 90ms time-to-first-audio — competitive with ElevenLabs Flash v2.5, faster than OpenAI TTS
- 68.4% win rate vs ElevenLabs Flash v2.5 in multilingual voice cloning benchmarks
- $0.016 per 1,000 characters via Mistral API — roughly half ElevenLabs Flash pricing
- Zero-shot voice cloning from 3 seconds of reference audio, no fine-tuning required
- 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
- Runs on consumer hardware — self-hosting is practical, eliminating third-party data exposure
- First frontier-quality open-weight TTS model to compete with closed commercial APIs
FAQ
Frequently Asked Questions
What is Mistral Voxtral and when was it released?
Mistral Voxtral (Voxtral-4B-TTS-2603) is a 4-billion-parameter open-weight text-to-speech model released by Mistral AI on March 26, 2026. It supports streaming with 90ms time-to-first-audio, covers 9 languages, offers zero-shot voice cloning from 3 seconds of reference audio, and is available on Hugging Face under CC BY NC 4.0 license. The Mistral API charges $0.016 per 1,000 characters.
How does Mistral Voxtral compare to ElevenLabs?
Voxtral achieves a 68.4% win rate against ElevenLabs Flash v2.5 in multilingual voice cloning benchmarks (Mistral's own evaluation). ElevenLabs Flash v2.5 is slightly faster at under 75ms vs Voxtral's 90ms, but costs $0.030 per 1,000 characters compared to Voxtral's $0.016 — roughly twice as expensive. Voxtral's main advantage is multilingual quality (particularly Hindi and Arabic) and the option to self-host for data privacy.
Can Voxtral be self-hosted?
Yes. Voxtral's open weights are available on Hugging Face under CC BY NC 4.0 (free for non-commercial, requires commercial license for commercial self-hosting). Mistral says the model runs on modern laptops — making it the first frontier-quality TTS model practical on consumer hardware. Self-hosting eliminates the network round-trip latency, keeps text on your own infrastructure, and removes per-character API charges. For organizations with existing GPU infrastructure, marginal cost is near zero.
What languages does Mistral Voxtral support?
Voxtral supports 9 languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The multilingual capability is one of its strongest differentiators vs ElevenLabs, which has historically been optimized for English. The 68.4% benchmark win rate against ElevenLabs is specifically in multilingual voice cloning evaluation.
What is Voxtral's voice cloning capability?
Voxtral supports zero-shot voice cloning — matching a speaker's voice characteristics for new text using just 3 seconds of reference audio, with no fine-tuning required. Few-shot cloning with additional reference audio improves quality further. This makes it viable for personalized voice applications, content localization, and audio dubbing workflows without the per-voice training cost that commercial providers typically charge.
Free Weekly Briefing
The AI & Dev Briefing
One honest email a week — what actually matters in AI and software engineering. No noise, no sponsored content. Read by developers across 30+ countries.
No spam. Unsubscribe anytime.
More on AI
All posts →Andrej Karpathy's 630-Line Python Script That Does AI Research Itself
Karpathy released AutoResearch: 630 lines of Python where AI agents design, run, and interpret ML experiments with no human in the loop.
Gemma 4 (April 2026): #3 Arena Open LLM, Apache 2, Developer Guide
Gemma 4 April 2, 2026: Google open weights on Gemini 3, 400M+ Gemma downloads, 31B #3 Arena open LLM, Apache 2. E2B–31B sizes, Ollama, vLLM, Vertex.
AI Developer Tools 2026: Agents, IDEs, APIs, and Self-Hosted Stack Hub
Hub for Cursor, Claude Code, Copilot, Codex, OpenClaw, MCP, Stitch, and API workflows: pick tools by job shape, not hype.
MiroFish 1M Agents: Why It Fails at Trading But Wins Everything Else
A developer ran 338 Polymarket trades with MiroFish swarm AI and made $4,266 profit — then hit the limits. Here is what 1M agents can and cannot predict.
Free Tool
Will AI replace your job?
4 questions. Get a personalised developer risk score based on your stack, role, and what you actually build day to day.
Check Your AI Risk Score →Written by
Software Engineer based in Delhi, India. Writes about AI models, semiconductor supply chains, and tech geopolitics — covering the intersection of infrastructure and global events. 952+ posts cited by ChatGPT, Perplexity, and Gemini. Read in 167 countries.
