
Picking a text-to-speech model used to be simple. You had a handful of robotic-sounding options, and the biggest decision was which one sounded least like a GPS from 2009. That has changed completely. In 2026, TTS models generate speech that is nearly indistinguishable from human recordings, and the number of available options has exploded.
So how do you actually choose the right one?
Whether you are building a voice agent, dubbing a film, narrating an audiobook, or adding accessibility features to a website, the model you pick shapes the final product. Not every model is built for every job. Some are optimized for speed. Others prioritize expressiveness. A few are small enough to run on a phone.
At the most basic level, a TTS model converts written text into spoken audio. But modern systems do far more than just "read aloud."
Current TTS models use deep learning trained on thousands of hours of recorded speech. Rather than stitching together pre-recorded syllables (the older approach), neural TTS learns the patterns of human speech, including pitch, rhythm, pauses, and emphasis. The result is audio that carries natural inflection and emotional texture.
The process typically follows two stages. First, the model analyzes the text for pronunciation, phrasing, and prosody (the musical quality of speech). Second, a decoder generates the actual audio waveform. Advanced models like CAMB.AI's MARS8 family handle both stages in a single architecture, reducing latency and improving consistency.
The jump in quality comes down to larger training datasets, better model architectures, and more sophisticated evaluation methods. Production-grade models now train on tens of thousands of hours of speech data, handling edge cases (unusual names, technical jargon, emotional shifts) much more reliably than earlier generations.
Not all TTS models solve the same problem. The four main categories serve very different use cases.
Built for conversations. Voice agents, contact center systems, and live translation tools all need speech generated in milliseconds, not seconds. MARS8-Flash, for example, is a 600M-parameter model built specifically for real-time applications, delivering time-to-first-byte (TTFB) as low as 100ms. Models in this category prioritize speed and responsiveness above all else.
When the output needs to carry emotion, nuance, or character, expressive models are the right fit. Film dubbing, audiobook narration, and media localization demand TTS that can convey excitement, sadness, or urgency convincingly. MARS8-Pro balances speed and fidelity, making it well-suited for expressive dubbing content and scenarios where short or challenging reference audio is provided.
Some applications cannot rely on cloud connectivity. Automotive systems, embedded devices, and mobile apps need TTS that runs locally. MARS8-Nano, at just 50M parameters, is designed for these situations, delivering TTFB as low as 50ms depending on device hardware, and is currently deployed across partners like Broadcom.
For high-end production work (think film dubbing, where a director needs precise control over tone and delivery), instruction-based models allow users to guide output through text descriptions. MARS8-Instruct, a 1.2B-parameter model, lets users independently tune speaker identity and prosody using both a reference clip and a written description of the desired delivery style.
Numbers matter when comparing TTS models. Here are the metrics that actually indicate production readiness.
Production Quality (PQ) scores measure how professionally produced the audio sounds. Content Enjoyment (CE) scores capture how natural and engaging the speech feels. Both are scored on a 1-to-10 scale using automated evaluation tools like Meta's Audiobox-Aesthetics model.
When cloning a voice, the key question is how closely the generated speech matches the original speaker. Speaker similarity is measured using cosine similarity between speaker embeddings. MARS8-Pro scores 0.87 on WavLM speaker verification and 0.71 on CAM embeddings across the open-source MAMBA Benchmark, which uses an average reference length of just 2.3 seconds.
For real-time applications, TTFB (the time between sending a text request and receiving the first audio chunk) is the most important performance metric. Sub-200ms TTFB is the threshold for conversational applications.
Character Error Rate (CER) measures how accurately the model pronounces words, verified by automatic speech recognition. Lower is better, and a CER under 6% is considered strong for multilingual output.
Global deployment requires models that handle multiple languages without sacrificing quality.
Leading TTS models now support dozens of languages. The MARS8 family covers languages representing 99% of the world's speaking population, with Premium support for languages trained on over 10,000 hours of data.
Cloning a voice in one language and generating speech in another is one of the hardest problems in TTS. The MAMBA Benchmark specifically tests this, with 70% of samples requiring cross-language cloning.
A model that supports "Spanish" may default to a single regional variant. Production use often requires distinguishing between Spanish (Spain) and Spanish (Mexico). Strong multilingual models offer language-region pairs rather than broad language codes.
The right model depends entirely on what you are building.
Prioritize latency and consistency. A voice agent that takes 800ms to respond will feel sluggish. For contact center deployments, MARS8-Flash is purpose-built for real-time voice agents with 600M parameters and sub-100ms TTFB on optimized hardware.
Prioritize expressiveness and speaker similarity. For dubbing workflows, the model needs to capture emotional range while maintaining the original speaker's identity across languages. MARS8-Pro handles expressive dubbing content, audiobooks, and digital media production.
For website accessibility (EU compliance, WCAG standards), the priority is clarity and reliability. CAMB.AI's Text-to-Speech tool is designed for this use case, converting website text into natural-sounding audio for users with visual impairments or reading challenges.
A model that performs well in demos is not necessarily ready for production. Here is what separates demo-grade from production-grade.
Production TTS must handle concurrent requests without degrading quality or increasing latency. Dedicated infrastructure prevents the queueing delays that turn a 100ms TTFB into an 800ms wait.
Enterprise deployments often require VPC deployment for data sovereignty, API access for developers, and availability across major cloud providers. CAMB.AI's MARS8 models launch natively on top compute platforms, giving teams flexibility in deployment.
Trustworthy models publish their evaluation methodology and make benchmarks reproducible. CAMB.AI open-sourced the MAMBA Benchmark specifically so the broader community can independently validate results rather than relying on cherry-picked demo samples.
For industries like healthcare, finance, and media, data sovereignty matters. SOC 2 Type II certification, VPC deployment options, and dedicated GPU resources keep sensitive data within a controlled environment. CAMB.AI holds SOC 2 Type II certification and supports enterprise-grade security.
The TTS landscape in 2026 is rich with options. Start with your use case, check the benchmarks that matter, and test under real-world conditions before committing.
Ya seas un profesional de los medios de comunicación o un desarrollador de productos de IA de voz, este boletín es tu guía de referencia sobre todo lo relacionado con la tecnología de voz y localización.


