IndexTTS 1.5 Release: High-Quality Chinese and English Text-to-Speech Model
Recently, the IndexTTS team released the new version IndexTTS 1.5, an advanced GPT-style text-to-speech (TTS) model. The new version achieves significant improvements in model stability and English speech synthesis, providing users with more fluent and natural speech synthesis experience.
Key Features
IndexTTS 1.5 includes the following core features:
- Chinese Pronunciation Optimization: Supports using pinyin to correct pronunciation of Chinese characters, ensuring accuracy of synthesized speech
- Flexible Pause Control: Precisely control pauses at any position in speech through punctuation marks
- High-Quality Audio: Integrates BigVGAN2 technology to optimize audio quality and voice timbre similarity
- Bilingual Support: Supports both Chinese and English speech synthesis, with significantly improved English performance in the new version
- Voice Cloning: Supports zero-shot voice cloning, requiring only 5-10 seconds of reference audio to achieve voice replication
Performance Results
IndexTTS 1.5 demonstrates excellent performance across multiple benchmark tests:
Word Error Rate (WER) Testing
On the seed-test dataset, IndexTTS 1.5 achieved the best performance:
- Chinese test: 0.821 (compared to human baseline 1.26)
- English test: 1.606 (compared to human baseline 2.14)
- Hard test: 6.565
Speaker Similarity Scores
In subjective evaluation of voice cloning, IndexTTS achieved the highest scores in prosody (3.79), timbre (4.20), and quality (4.05), with an average score of 4.01.
ComfyUI Integration
Users can easily use IndexTTS through ComfyUI:
- Search for “IndexTTS” in the ComfyUI node manager for installation
- Download model files to the
models/TTS/Index-TTS
directory - Upload 5-10 seconds of reference audio file
- Input the text to be synthesized to generate speech
The plugin requires approximately 8GB of VRAM, suitable for most consumer-grade graphics cards.
Online Experience
You can experience IndexTTS effects through the following online platform: https://huggingface.co/spaces/IndexTeam/IndexTTS
Technical Architecture
IndexTTS is built on XTTS and Tortoise technologies, using a Conformer conditioning encoder and BigVGAN2 speech decoder. The model is trained on tens of thousands of hours of speech data, ensuring excellent performance.
For Chinese scenarios, the team introduced a character-pinyin hybrid modeling approach, allowing users to quickly correct mispronounced characters, which is significant for Chinese TTS applications.
Development Timeline
- May 14, 2025: Released IndexTTS 1.5 version, significantly improving model stability and English performance
- March 25, 2025: Released IndexTTS 1.0 model parameters and inference code
- February 12, 2025: Submitted paper on arXiv and released demos and test sets
Related Links
IndexTTS is developed by a team dedicated to advancing speech synthesis technology. The open-source nature of this project provides strong support for research and application development in the speech synthesis field.