Skip to content
Follow me on X
ComfyUI Wiki
NewsAlibaba Qwen Releases Qwen3-TTS - 97ms Ultra-Low Latency Voice Synthesis Model

Alibaba Qwen Releases Qwen3-TTS - 97ms Ultra-Low Latency Voice Synthesis Model

On January 22, 2026, Alibaba Qwen team officially open-sourced the Qwen3-TTS series voice generation models, a powerful voice synthesis system that comprehensively supports voice cloning, voice creation, ultra-high quality humanized voice generation, and natural language-based voice control. The release of this model series is considered a major breakthrough in the voice synthesis field.

Core Innovation

Dual-Track Modeling

The core innovation of Qwen3-TTS lies in the Dual-Track hybrid streaming generation mechanism, combined with discrete multi-codebook language models, directly modeling speech end-to-end, avoiding information bottlenecks of traditional cascaded architectures (like LM+DiT).

This innovative architecture achieves:

  • Ultra-Low Latency: End-to-end synthesis latency as low as 97ms
  • Instant Response: Output first audio packet with just 1 character input
  • Dual Mode Support: Single model supports both streaming and non-streaming generation

This ultimate response speed approaches human conversation response speed, making it ideal for latency-sensitive scenarios like live interaction, real-time translation, and AI customer service.

Qwen3-TTS-Tokenizer-12Hz

The model relies on the innovative Qwen3-TTS-Tokenizer-12Hz multi-codebook speech encoder, achieving efficient compression and strong representation of speech signals:

  • Completely preserves paralinguistic information (like intonation, rhythm, emotion)
  • Preserves acoustic environment characteristics
  • Achieves high-speed, high-fidelity speech restoration through lightweight non-DiT architecture

Discrete Multi-Codebook LM Architecture

Adopts discrete multi-codebook language model (LM) architecture, achieving end-to-end modeling of full speech information:

  • Completely avoids information bottlenecks of traditional LM+DiT solutions
  • Avoids cascading errors
  • Significantly improves model versatility, generation efficiency, and performance ceiling

Model Series

Qwen3-TTS provides two parameter scales to meet different scenario needs:

1.7B Model Series

Ultimate Performance, Powerful Control

Qwen3-TTS-12Hz-1.7B-VoiceDesign

  • Performs voice design based on user-provided natural language descriptions
  • Can freely define acoustic attributes, persona, and background information
  • Creates unique customized voices

Qwen3-TTS-12Hz-1.7B-CustomVoice

  • Provides style control over target voices through user instructions
  • Supports 9 premium voices covering various combinations of gender, age, language, and dialect
  • Can flexibly control voice, emotion, prosody, and other multi-dimensional acoustic attributes through instructions

Qwen3-TTS-12Hz-1.7B-Base

  • Base model supporting rapid voice cloning from user-provided 3-second audio
  • Can be used for fine-tuning other models
  • Provides maximum flexibility and customization space

0.6B Model Series

Balanced Performance and Efficiency

Qwen3-TTS-12Hz-0.6B-CustomVoice

  • Supports 9 premium voices
  • Significantly reduces resource consumption while maintaining good results
  • Suitable for deployment on resource-constrained edge devices or mobile devices

Qwen3-TTS-12Hz-0.6B-Base

  • Base model supporting 3-second rapid voice cloning
  • Lower computational resource requirements
  • Suitable for high-concurrency deployment scenarios

Core Features

3-Second Rapid Voice Cloning

Voice cloning capability is particularly impressive:

  • Only 3 seconds of reference audio needed to achieve high-fidelity zero-shot voice replication
  • Cloned voices support seamless cross-language migration
  • Chinese voices can directly speak English, Japanese, Korean, and 10 other languages
  • Simultaneously preserves original voice characteristics

Cross-Language/Dialect Zero-Loss Migration

  • Supports 10 major languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
  • Supports multiple Chinese dialects: Sichuan dialect, Beijing dialect, etc.
  • Highly accurate accent and charm restoration
  • Opens new possibilities for multilingual content creation and localization applications

Natural Language Voice Design

Voice Design feature allows users to customize voices through natural language instructions:

  • “Use a gentle, encouraging mature female voice to tell stories”
  • “Use an excited, high-pitched young male voice to commentate games”
  • Model automatically adjusts intonation, emotion, and rhythm
  • Generates highly personalized expressions

This “what you imagine is what you hear” control capability is particularly useful in audiobook production—one person can play multiple roles, with emotional ups and downs and dialect switching all mastered.

Intelligent Context Understanding

Model has strong text semantic understanding capabilities:

  • Can automatically adjust tone, rhythm, and emotion based on input text
  • Adapts to different scenario needs
  • Significantly improved robustness to input text noise
  • Achieves humanized natural expression

Performance

Content Consistency (WER)

Excellent performance in content consistency evaluation:

  • Chinese: WER 0.77
  • English: WER 1.24

Controllable Speech Generation

Qwen3-TTS-12Hz-1.7B-CustomVoice shows strong performance in the following metrics:

  • APS (Audio Prosody Similarity): High prosody similarity
  • DSD (Duration Similarity Distance): Precise duration control
  • RP (Rhythm Preservation): Excellent rhythm preservation

Voice Design

Qwen3-TTS-12Hz-1.7B-VoiceDesign achieves SOTA (State-of-the-Art) level in voice design tasks.

Speech Encoder

Qwen-TTS-Tokenizer-12Hz shows excellent performance in the following metrics:

  • PESQ: Perceptual Evaluation of Speech Quality
  • STOI: Short-Time Objective Intelligibility
  • UTMOS: Mean Opinion Score
  • SIM: Similarity

Application Scenarios

Intelligent Voice Assistants

  • Provide natural voice interaction for smart home devices and in-car systems
  • Support multiple languages and dialects
  • Enhance user experience

Content Creation

  • Quickly convert text to natural speech
  • Support multiple voices and emotional expressions
  • Suitable for audiobooks and video dubbing
  • One person plays multiple roles, producing high-quality audio content

Education

  • Provide multilingual, multi-voice speech output for language learning and online teaching
  • Enhance learning effectiveness
  • Support dialect teaching

Gaming and Entertainment

  • Generate personalized voices for game characters
  • Support emotion and tone adjustment
  • Enhance game immersion

Customer Service

  • Provide natural, friendly voice interaction for intelligent customer service
  • Support real-time dialogue
  • Reduce customer service costs

Live Streaming Interaction

  • Ultra-low latency meets real-time interaction needs
  • Support multilingual live streaming
  • Enhance audience experience

Technical Advantages

End-to-End Architecture

  • Avoids information bottlenecks of traditional cascaded architectures
  • Reduces cascading errors
  • Improves overall performance

Lightweight and Efficient

  • Non-DiT architecture effectively improves computational efficiency while ensuring high-fidelity restoration
  • 0.6B model suitable for edge device deployment
  • 1.7B model pursues ultimate performance

Open Source Friendly

  • Complete series open-sourced to GitHub and Hugging Face
  • Supports full parameter fine-tuning
  • Developers can easily build brand-specific voice images

Open Source and Availability

Qwen3-TTS full series models are completely open-source, supporting:

  • Free commercial use
  • Local deployment
  • Secondary development
  • API calls

Access

Technical Significance

The open-sourcing of Qwen3-TTS brings multiple breakthroughs to the voice synthesis field:

  1. Ultra-Low Latency: 97ms end-to-end latency approaches human conversation response speed
  2. High-Fidelity Cloning: Voice cloning achievable with 3 seconds of audio
  3. Cross-Language Capability: Single voice supports seamless switching across 10 languages
  4. Natural Language Control: Voice design achievable through text descriptions
  5. Open Source Ecosystem: Significantly lowers barriers for real-time, personalized, multilingual voice AI

With the open-sourcing of Qwen3-TTS, barriers for real-time, personalized, multilingual voice AI have been significantly lowered. Whether content creators, developers, or enterprise applications, all will usher in a new wave of voice interaction revolution.