Higgs TTS 3: Boson AI's 4B Multilingual Speech Model with 100+ Language Support
Higgs TTS 3 is a 4B parameter text-to-speech model supporting 100+ languages with zero-shot voice cloning, expressive emotional control, and inline prosody/sound effects for voice agent applications.
Overview
Released by Boson AI on June 4, 2026, Higgs TTS 3 (model ID: bosonai/higgs-tts-3-4b) is a powerful 4 billion parameter text-to-speech model designed specifically for voice agent and conversational AI applications. Unlike traditional TTS systems that merely "read" text, Higgs TTS 3 is built to "speak": generating expressive, natural conversational speech with emotional nuance.
The model is built on a Higgs multimodal architecture based on Qwen3, with an autoregressive decoder that consumes interleaved text and audio tokens. Audio is encoded by the Higgs Tokenizer into 8 codebooks at 25 fps using a staggered delay pattern, then decoded back to high-quality waveform.
Key Features
| Feature | Description |
|---|---|
| Parameters | 4 billion |
| Languages | 100+ (extensive multilingual coverage) |
| Architecture | Higgs multimodal Qwen3-based autoregressive decoder |
| Voice Cloning | Zero-shot voice cloning from reference audio |
| Control | 21 emotions, 10 prosody controls, inline sound effects |
| License | Research and non-commercial |
| Library | Transformers (Hugging Face) |
Multilingual Support
Higgs TTS 3 supports over 100 languages including major language families:
- European: English, Spanish, French, German, Italian, Portuguese, Russian, Polish, Dutch, Swedish, Norwegian, Danish, Finnish, Greek, Czech, Romanian, Hungarian, Ukrainian, and many more
- Asian: Chinese (Mandarin/Cantonese), Japanese, Korean, Hindi, Bengali, Tamil, Telugu, Urdu, Vietnamese, Thai, Indonesian, Malay, Burmese, Khmer, Lao, and more
- Middle Eastern / African: Arabic, Hebrew, Turkish, Persian (Farsi), Swahili, Amharic, Hausa, Yoruba, Igbo, Zulu, Xhosa, and more
- Other: Tagalog, Nepali, Sinhala, Georgian, Armenian, Azerbaijani, Kazakh, Uzbek, and many more
Expressive Control
Higgs TTS 3 provides fine-grained control over speech output through inline control tags embedded in the input text:
21 Emotions (sentence-level)
affection, amusement, anger, arousal, awe, bitterness, confusion, contemplation, contentment, determination, disgust, elation, enthusiasm, fear, helplessness, longing, pride, relief, sadness, shame, surprise
Prosody Controls (10)
Speed control: speed_very_slow, speed_slow, speed_fast, speed_very_fast
Pitch: pitch_low, pitch_high
Expressiveness: expressive_more, expressive_less, pause, long_pause
Inline Sound Effects
Sound effects can be triggered inline: cough, laughter, sigh, applause, bell, knock, and many more.
Example Usage
<|emotion:elation|>Welcome aboard, we are absolutely thrilled to have you here!
<|sfx:cough|>Ahem, let me begin today's presentation.
<|style:whispering|>Come closer, I have a little secret to share.Zero-Shot Voice Cloning
The model supports zero-shot voice cloning from a short reference audio sample, allowing it to synthesize speech in a target voice without any fine-tuning. This makes it suitable for:
- Voice agent applications with consistent character voices
- Multilingual content creation in a single voice
- Personalized speech synthesis
Availability
- Hugging Face: bosonai/higgs-tts-3-4b
- Blog Post: Boson AI Blog: Higgs Audio v3
Summary
Higgs TTS 3 represents a significant advancement in open-weight multilingual speech synthesis, combining a 4B parameter backbone with extensive language coverage, expressive emotional control, and zero-shot voice cloning capabilities. For developers building voice agents or multilingual speech applications, it offers a compelling research-grade solution with state-of-the-art expressiveness.