Skip to content
ComfyUI Wiki
Help Build a Better ComfyUI Knowledge Base Become a Patron
NewsTongyi Lab Releases ThinkSound: A New Paradigm for Multimodal Audio Generation and Editing

ThinkSound: A New Paradigm for Multimodal Audio Generation and Editing

ThinkSound is the latest open-source multimodal audio generation and editing framework from Tongyi Lab, pioneering the introduction of Chain-of-Thought (CoT) reasoning into audio generation. The framework supports audio generation and editing from various modalities including video, text, and audio, featuring high fidelity, strong synchronization, and interactivity, empowering AI to “think and create sound like human sound designers.”

Key Features

  • Any2Audio: Supports audio generation from any modality input including video, text, and audio.
  • Chain Reasoning Driven: Achieves step-by-step reasoning through multimodal large language models (MLLM), enhancing temporal and semantic consistency between sound, visuals, and text.
  • Interactive Object-Level Editing: Enables refinement or editing of specific sound events through video object clicking or text instructions.
  • Unified Framework: Single model supporting generation, refinement, editing, and interactive workflow.
  • High Fidelity and Strong Synchronization: Excellent performance on authoritative test sets including V2A and film sound effects.

Technical Highlights and Workflow

ThinkSound divides audio generation and editing into three stages:

  1. Overall Soundscape Generation: Generates basic soundscape from video, ensuring semantic and temporal alignment.
  2. Object-Level Refinement: Focuses on specific sound source areas in the video to generate dedicated sounds.
  3. Instruction-Level Editing: Interactively edits audio content based on user natural language instructions.

ThinkSound Method Overview

Method overview: Supports audio generation from any modality input with interactive editing capabilities.

ThinkSound Technical Architecture

Technical architecture: Multimodal large language models work in conjunction with flow-matching audio generation models.

Dataset and Open Source

Tongyi Lab has built AudioCoT, a multimodal audio dataset supporting chain reasoning, covering various real-world scenarios including animals, machinery, and environments, featuring high data quality and support for object-level and instruction-level interactive editing.

Evaluation and Applications

ThinkSound significantly outperforms mainstream methods (such as MMAudio, V2A-Mappe, V-AURA, MovieGenAudio) on core metrics in authoritative test sets including VGGSound and MovieGen Audio Bench, demonstrating broad application potential in film sound effects, gaming, virtual reality, and other fields.

Images and content partially referenced from the official project page and paper, for technical introduction and learning exchange only. Please contact the original authors for any inquiries.