Tencent Releases HunyuanCustom Multimodal Video Generation System
Tencent has recently unveiled an innovative video generation technology—HunyuanCustom, a multimodal video customization framework capable of maintaining subject consistency while supporting various input conditions including text, images, audio, and video. The technology has been open-sourced along with relevant models and code, bringing new possibilities to video content creation.
Technical Innovation
Built upon the Hunyuan Video generation framework, HunyuanCustom focuses on addressing two major challenges in current video generation technology: identity consistency and limited input modalities. The technology introduces several key innovations:
- Text-Image Fusion Module: Based on LLaVA technology, enhancing multimodal understanding capabilities
- Image ID Enhancement Module: Leveraging temporal concatenation to reinforce identity features across frames
- Modality-Specific Condition Injection Mechanisms:
- AudioNet Module: Achieving hierarchical alignment via spatial cross-attention
- Video-Driven Injection Module: Integrating conditional video through a patchify-based feature-alignment network
These technical innovations enable HunyuanCustom to significantly outperform existing open and closed-source methods in terms of identity consistency, realism, and text-video alignment.
Multimodal Video Customization Capabilities
HunyuanCustom supports various forms of input, specifically including:
- Text and Image Input: Can handle single or multiple image inputs to enable customized video generation for one or more subjects
- Audio Input: Can incorporate additional audio input to drive the subject to speak the corresponding audio content
- Video Input: Supports video input, allowing for the replacement of specific objects in the video with subjects from a given image
Application Scenarios
The multimodal capabilities of HunyuanCustom support various downstream tasks:
- Virtual Human Advertisements: Creating product showcase videos by inputting multiple images
- Virtual Try-On: Generating videos of people wearing specific clothing
- Singing Avatars: Creating virtual characters that sing by combining image and audio
- Video Editing: Using image and video as inputs to replace subjects in videos
Performance Comparison
HunyuanCustom was compared with state-of-the-art video customization methods including VACE, Skyreels, Pika, Vidu, Keling, and Hailuo. The evaluation focused on face/subject consistency, video-text alignment, and overall video quality.
In terms of key metrics, HunyuanCustom demonstrated significant advantages:
- Face Similarity (Face-Sim): 0.627 (Ranked 1st)
- DINO Similarity: 0.593 (Ranked 1st)
- Temporal Consistency: 0.958 (Close to the best)
Open-Source Plan
Tencent released the inference code and model weights of HunyuanCustom on May 8, 2025. According to the open-source plan, the team will progressively release:
- Single-Subject Video Customization
- Inference code (already released)
- Model checkpoints (already released)
- ComfyUI plugin
- Audio-Driven Video Customization
- Video-Driven Video Customization
- Multi-Subject Video Customization
System Requirements
The system requirements for generating videos with the HunyuanCustom model are as follows:
Model | Setting (height/width/frames) | GPU Peak Memory |
---|---|---|
HunyuanCustom | 720px1280px129f | 80GB |
HunyuanCustom | 512px896px129f | 60GB |
- Minimum requirement: At least 24GB of VRAM is needed to generate 720p videos (but very slow)
- Recommended configuration: A GPU with 80GB of memory is recommended for better generation quality