IndexTTS 1.5 Release: High-Quality Chinese and English Text-to-Speech Model

Recently, the IndexTTS team released the new version IndexTTS 1.5, an advanced GPT-style text-to-speech (TTS) model. The new version achieves significant improvements in model stability and English speech synthesis, providing users with more fluent and natural speech synthesis experience.

Key Features

IndexTTS 1.5 includes the following core features:

Chinese Pronunciation Optimization: Supports using pinyin to correct pronunciation of Chinese characters, ensuring accuracy of synthesized speech
Flexible Pause Control: Precisely control pauses at any position in speech through punctuation marks
High-Quality Audio: Integrates BigVGAN2 technology to optimize audio quality and voice timbre similarity
Bilingual Support: Supports both Chinese and English speech synthesis, with significantly improved English performance in the new version
Voice Cloning: Supports zero-shot voice cloning, requiring only 5-10 seconds of reference audio to achieve voice replication

Performance Results

IndexTTS 1.5 demonstrates excellent performance across multiple benchmark tests:

Word Error Rate (WER) Testing

On the seed-test dataset, IndexTTS 1.5 achieved the best performance:

Chinese test: 0.821 (compared to human baseline 1.26)
English test: 1.606 (compared to human baseline 2.14)
Hard test: 6.565

Speaker Similarity Scores

In subjective evaluation of voice cloning, IndexTTS achieved the highest scores in prosody (3.79), timbre (4.20), and quality (4.05), with an average score of 4.01.

ComfyUI Integration

Users can easily use IndexTTS through ComfyUI:

Search for "IndexTTS" in the ComfyUI node manager for installation
Download model files to the models/TTS/Index-TTS directory
Upload 5-10 seconds of reference audio file
Input the text to be synthesized to generate speech

The plugin requires approximately 8GB of VRAM, suitable for most consumer-grade graphics cards.

Online Experience

You can experience IndexTTS effects through the following online platform: https://huggingface.co/spaces/IndexTeam/IndexTTS

Technical Architecture

IndexTTS is built on XTTS and Tortoise technologies, using a Conformer conditioning encoder and BigVGAN2 speech decoder. The model is trained on tens of thousands of hours of speech data, ensuring excellent performance.

For Chinese scenarios, the team introduced a character-pinyin hybrid modeling approach, allowing users to quickly correct mispronounced characters, which is significant for Chinese TTS applications.

Development Timeline

May 14, 2025: Released IndexTTS 1.5 version, significantly improving model stability and English performance
March 25, 2025: Released IndexTTS 1.0 model parameters and inference code
February 12, 2025: Submitted paper on arXiv and released demos and test sets

IndexTTS is developed by a team dedicated to advancing speech synthesis technology. The open-source nature of this project provides strong support for research and application development in the speech synthesis field.