OpenMOSS Releases MOVA - Open-Source Synchronized Video and Audio Generation Model

01/29/2026

Alibaba Qwen Releases Qwen3-TTS - 97ms Ultra-Low Latency Voice Synthesis Model

On January 22, 2026, Alibaba Qwen team officially open-sourced the Qwen3-TTS series voice generation models, a powerful voice synthesis system that comprehensively supports voice cloning, voice creation, ultra-high quality humanized voice generation, and natural language-based voice control. The release of this model series is considered a major breakthrough in the voice synthesis field.

Core Innovation

Dual-Track Modeling

The core innovation of Qwen3-TTS lies in the Dual-Track hybrid streaming generation mechanism, combined with discrete multi-codebook language models, directly modeling speech end-to-end, avoiding information bottlenecks of traditional cascaded architectures (like LM+DiT).

This innovative architecture achieves:

Ultra-Low Latency: End-to-end synthesis latency as low as 97ms
Instant Response: Output first audio packet with just 1 character input
Dual Mode Support: Single model supports both streaming and non-streaming generation

This ultimate response speed approaches human conversation response speed, making it ideal for latency-sensitive scenarios like live interaction, real-time translation, and AI customer service.

Qwen3-TTS-Tokenizer-12Hz

The model relies on the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, achieving efficient compression and strong representation of speech signals:

Completely preserves paralinguistic information (like intonation, rhythm, emotion)
Preserves acoustic environment characteristics
Achieves high-speed, high-fidelity speech restoration through lightweight non-DiT architecture

Discrete Multi-Codebook LM Architecture

Adopts discrete multi-codebook language model (LM) architecture, achieving end-to-end modeling of full speech information:

Completely avoids information bottlenecks of traditional LM+DiT solutions
Avoids cascading errors
Significantly improves model versatility, generation efficiency, and performance ceiling

Model Series

Qwen3-TTS provides two parameter scales to meet different scenario needs:

1.7B Model Series

Ultimate Performance, Powerful Control

Qwen3-TTS-12Hz-1.7B-VoiceDesign

Performs voice design based on user-provided natural language descriptions
Can freely define acoustic attributes, persona, and background information
Creates unique customized voices

Qwen3-TTS-12Hz-1.7B-CustomVoice

Provides style control over target voices through user instructions
Supports 9 premium voices covering various combinations of gender, age, language, and dialect
Can flexibly control voice, emotion, prosody, and other multi-dimensional acoustic attributes through instructions

Qwen3-TTS-12Hz-1.7B-Base

Base model supporting rapid voice cloning from user-provided 3-second audio
Can be used for fine-tuning other models
Provides maximum flexibility and customization space

0.6B Model Series

Balanced Performance and Efficiency

Qwen3-TTS-12Hz-0.6B-CustomVoice

Supports 9 premium voices
Significantly reduces resource consumption while maintaining good results
Suitable for deployment on resource-constrained edge devices or mobile devices

Qwen3-TTS-12Hz-0.6B-Base

Base model supporting 3-second rapid voice cloning
Lower computational resource requirements
Suitable for high-concurrency deployment scenarios

Core Features

3-Second Rapid Voice Cloning

Voice cloning capability is particularly impressive:

Only 3 seconds of reference audio needed to achieve high-fidelity zero-shot voice replication
Cloned voices support seamless cross-language migration
Chinese voices can directly speak English, Japanese, Korean, and 10 other languages
Simultaneously preserves original voice characteristics

Cross-Language/Dialect Zero-Loss Migration

Supports 10 major languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Supports multiple Chinese dialects: Sichuan dialect, Beijing dialect, etc.
Highly accurate accent and charm restoration
Opens new possibilities for multilingual content creation and localization applications

Natural Language Voice Design

Voice Design feature allows users to customize voices through natural language instructions:

“Use a gentle, encouraging mature female voice to tell stories”
“Use an excited, high-pitched young male voice to commentate games”
Model automatically adjusts intonation, emotion, and rhythm
Generates highly personalized expressions

This “what you imagine is what you hear” control capability is particularly useful in audiobook production—one person can play multiple roles, with emotional ups and downs and dialect switching all mastered.

Intelligent Context Understanding

Model has strong text semantic understanding capabilities:

Can automatically adjust tone, rhythm, and emotion based on input text
Adapts to different scenario needs
Significantly improved robustness to input text noise
Achieves humanized natural expression

Performance

Content Consistency (WER)

Excellent performance in content consistency evaluation:

Chinese: WER 0.77
English: WER 1.24

Controllable Speech Generation

Qwen3-TTS-12Hz-1.7B-CustomVoice shows strong performance in the following metrics:

APS (Audio Prosody Similarity): High prosody similarity
DSD (Duration Similarity Distance): Precise duration control
RP (Rhythm Preservation): Excellent rhythm preservation

Voice Design

Qwen3-TTS-12Hz-1.7B-VoiceDesign achieves SOTA (State-of-the-Art) level in voice design tasks.

Speech Encoder

Qwen-TTS-Tokenizer-12Hz shows excellent performance in the following metrics:

PESQ: Perceptual Evaluation of Speech Quality
STOI: Short-Time Objective Intelligibility
UTMOS: Mean Opinion Score
SIM: Similarity

Application Scenarios

Intelligent Voice Assistants

Provide natural voice interaction for smart home devices and in-car systems
Support multiple languages and dialects
Enhance user experience

Content Creation

Quickly convert text to natural speech
Support multiple voices and emotional expressions
Suitable for audiobooks and video dubbing
One person plays multiple roles, producing high-quality audio content

Education

Provide multilingual, multi-voice speech output for language learning and online teaching
Enhance learning effectiveness
Support dialect teaching

Gaming and Entertainment

Generate personalized voices for game characters
Support emotion and tone adjustment
Enhance game immersion

Customer Service

Provide natural, friendly voice interaction for intelligent customer service
Support real-time dialogue
Reduce customer service costs

Live Streaming Interaction

Ultra-low latency meets real-time interaction needs
Support multilingual live streaming
Enhance audience experience

Technical Advantages

End-to-End Architecture

Avoids information bottlenecks of traditional cascaded architectures
Reduces cascading errors
Improves overall performance

Lightweight and Efficient

Non-DiT architecture effectively improves computational efficiency while ensuring high-fidelity restoration
0.6B model suitable for edge device deployment
1.7B model pursues ultimate performance

Open Source Friendly

Complete series open-sourced to GitHub and Hugging Face
Supports full parameter fine-tuning
Developers can easily build brand-specific voice images

Open Source and Availability

Qwen3-TTS full series models are completely open-source, supporting:

Free commercial use
Local deployment
Secondary development
API calls

Access

GitHub Repository: https://github.com/QwenLM/Qwen3-TTS
HuggingFace Model Library: https://huggingface.co/collections/Qwen/qwen3-tts
ModelScope: https://www.modelscope.cn/collections/Qwen/Qwen3-TTS
Qwen API: Can be experienced directly through official API

Technical Significance

The open-sourcing of Qwen3-TTS brings multiple breakthroughs to the voice synthesis field:

Ultra-Low Latency: 97ms end-to-end latency approaches human conversation response speed
High-Fidelity Cloning: Voice cloning achievable with 3 seconds of audio
Cross-Language Capability: Single voice supports seamless switching across 10 languages
Natural Language Control: Voice design achievable through text descriptions
Open Source Ecosystem: Significantly lowers barriers for real-time, personalized, multilingual voice AI

With the open-sourcing of Qwen3-TTS, barriers for real-time, personalized, multilingual voice AI have been significantly lowered. Whether content creators, developers, or enterprise applications, all will usher in a new wave of voice interaction revolution.

GitHub Repository: https://github.com/QwenLM/Qwen3-TTS
HuggingFace Model: https://huggingface.co/collections/Qwen/qwen3-tts
ModelScope: https://www.modelscope.cn/collections/Qwen/Qwen3-TTS
Qwen Official Blog: https://qwenlm.github.io/blog/qwen3-tts/