Alibaba Qwen Releases Qwen3-TTS - 97ms Ultra-Low Latency Voice Synthesis Model
On January 22, 2026, Alibaba Qwen team officially open-sourced the Qwen3-TTS series voice generation models, a powerful voice synthesis system that comprehensively supports voice cloning, voice creation, ultra-high quality humanized voice generation, and natural language-based voice control. The release of this model series is considered a major breakthrough in the voice synthesis field.
Core Innovation
Dual-Track Modeling
The core innovation of Qwen3-TTS lies in the Dual-Track hybrid streaming generation mechanism, combined with discrete multi-codebook language models, directly modeling speech end-to-end, avoiding information bottlenecks of traditional cascaded architectures (like LM+DiT).
This innovative architecture achieves:
- Ultra-Low Latency: End-to-end synthesis latency as low as 97ms
- Instant Response: Output first audio packet with just 1 character input
- Dual Mode Support: Single model supports both streaming and non-streaming generation
This ultimate response speed approaches human conversation response speed, making it ideal for latency-sensitive scenarios like live interaction, real-time translation, and AI customer service.
Qwen3-TTS-Tokenizer-12Hz
The model relies on the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, achieving efficient compression and strong representation of speech signals:
- Completely preserves paralinguistic information (like intonation, rhythm, emotion)
- Preserves acoustic environment characteristics
- Achieves high-speed, high-fidelity speech restoration through lightweight non-DiT architecture
Discrete Multi-Codebook LM Architecture
Adopts discrete multi-codebook language model (LM) architecture, achieving end-to-end modeling of full speech information:
- Completely avoids information bottlenecks of traditional LM+DiT solutions
- Avoids cascading errors
- Significantly improves model versatility, generation efficiency, and performance ceiling
Model Series
Qwen3-TTS provides two parameter scales to meet different scenario needs:
1.7B Model Series
Ultimate Performance, Powerful Control
Qwen3-TTS-12Hz-1.7B-VoiceDesign
- Performs voice design based on user-provided natural language descriptions
- Can freely define acoustic attributes, persona, and background information
- Creates unique customized voices
Qwen3-TTS-12Hz-1.7B-CustomVoice
- Provides style control over target voices through user instructions
- Supports 9 premium voices covering various combinations of gender, age, language, and dialect
- Can flexibly control voice, emotion, prosody, and other multi-dimensional acoustic attributes through instructions
Qwen3-TTS-12Hz-1.7B-Base
- Base model supporting rapid voice cloning from user-provided 3-second audio
- Can be used for fine-tuning other models
- Provides maximum flexibility and customization space
0.6B Model Series
Balanced Performance and Efficiency
Qwen3-TTS-12Hz-0.6B-CustomVoice
- Supports 9 premium voices
- Significantly reduces resource consumption while maintaining good results
- Suitable for deployment on resource-constrained edge devices or mobile devices
Qwen3-TTS-12Hz-0.6B-Base
- Base model supporting 3-second rapid voice cloning
- Lower computational resource requirements
- Suitable for high-concurrency deployment scenarios
Core Features
3-Second Rapid Voice Cloning
Voice cloning capability is particularly impressive:
- Only 3 seconds of reference audio needed to achieve high-fidelity zero-shot voice replication
- Cloned voices support seamless cross-language migration
- Chinese voices can directly speak English, Japanese, Korean, and 10 other languages
- Simultaneously preserves original voice characteristics
Cross-Language/Dialect Zero-Loss Migration
- Supports 10 major languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
- Supports multiple Chinese dialects: Sichuan dialect, Beijing dialect, etc.
- Highly accurate accent and charm restoration
- Opens new possibilities for multilingual content creation and localization applications
Natural Language Voice Design
Voice Design feature allows users to customize voices through natural language instructions:
- “Use a gentle, encouraging mature female voice to tell stories”
- “Use an excited, high-pitched young male voice to commentate games”
- Model automatically adjusts intonation, emotion, and rhythm
- Generates highly personalized expressions
This “what you imagine is what you hear” control capability is particularly useful in audiobook production—one person can play multiple roles, with emotional ups and downs and dialect switching all mastered.
Intelligent Context Understanding
Model has strong text semantic understanding capabilities:
- Can automatically adjust tone, rhythm, and emotion based on input text
- Adapts to different scenario needs
- Significantly improved robustness to input text noise
- Achieves humanized natural expression
Performance
Content Consistency (WER)
Excellent performance in content consistency evaluation:
- Chinese: WER 0.77
- English: WER 1.24
Controllable Speech Generation
Qwen3-TTS-12Hz-1.7B-CustomVoice shows strong performance in the following metrics:
- APS (Audio Prosody Similarity): High prosody similarity
- DSD (Duration Similarity Distance): Precise duration control
- RP (Rhythm Preservation): Excellent rhythm preservation
Voice Design
Qwen3-TTS-12Hz-1.7B-VoiceDesign achieves SOTA (State-of-the-Art) level in voice design tasks.
Speech Encoder
Qwen-TTS-Tokenizer-12Hz shows excellent performance in the following metrics:
- PESQ: Perceptual Evaluation of Speech Quality
- STOI: Short-Time Objective Intelligibility
- UTMOS: Mean Opinion Score
- SIM: Similarity
Application Scenarios
Intelligent Voice Assistants
- Provide natural voice interaction for smart home devices and in-car systems
- Support multiple languages and dialects
- Enhance user experience
Content Creation
- Quickly convert text to natural speech
- Support multiple voices and emotional expressions
- Suitable for audiobooks and video dubbing
- One person plays multiple roles, producing high-quality audio content
Education
- Provide multilingual, multi-voice speech output for language learning and online teaching
- Enhance learning effectiveness
- Support dialect teaching
Gaming and Entertainment
- Generate personalized voices for game characters
- Support emotion and tone adjustment
- Enhance game immersion
Customer Service
- Provide natural, friendly voice interaction for intelligent customer service
- Support real-time dialogue
- Reduce customer service costs
Live Streaming Interaction
- Ultra-low latency meets real-time interaction needs
- Support multilingual live streaming
- Enhance audience experience
Technical Advantages
End-to-End Architecture
- Avoids information bottlenecks of traditional cascaded architectures
- Reduces cascading errors
- Improves overall performance
Lightweight and Efficient
- Non-DiT architecture effectively improves computational efficiency while ensuring high-fidelity restoration
- 0.6B model suitable for edge device deployment
- 1.7B model pursues ultimate performance
Open Source Friendly
- Complete series open-sourced to GitHub and Hugging Face
- Supports full parameter fine-tuning
- Developers can easily build brand-specific voice images
Open Source and Availability
Qwen3-TTS full series models are completely open-source, supporting:
- Free commercial use
- Local deployment
- Secondary development
- API calls
Access
- GitHub Repository: https://github.com/QwenLM/Qwen3-TTS
- HuggingFace Model Library: https://huggingface.co/collections/Qwen/qwen3-tts
- ModelScope: https://www.modelscope.cn/collections/Qwen/Qwen3-TTS
- Qwen API: Can be experienced directly through official API
Technical Significance
The open-sourcing of Qwen3-TTS brings multiple breakthroughs to the voice synthesis field:
- Ultra-Low Latency: 97ms end-to-end latency approaches human conversation response speed
- High-Fidelity Cloning: Voice cloning achievable with 3 seconds of audio
- Cross-Language Capability: Single voice supports seamless switching across 10 languages
- Natural Language Control: Voice design achievable through text descriptions
- Open Source Ecosystem: Significantly lowers barriers for real-time, personalized, multilingual voice AI
With the open-sourcing of Qwen3-TTS, barriers for real-time, personalized, multilingual voice AI have been significantly lowered. Whether content creators, developers, or enterprise applications, all will usher in a new wave of voice interaction revolution.
Related Links
- GitHub Repository: https://github.com/QwenLM/Qwen3-TTS
- HuggingFace Model: https://huggingface.co/collections/Qwen/qwen3-tts
- ModelScope: https://www.modelscope.cn/collections/Qwen/Qwen3-TTS
- Qwen Official Blog: https://qwenlm.github.io/blog/qwen3-tts/