Skip to content
Help Build a Better ComfyUI Knowledge Base Become a Patron
NewsTencent Releases HunyuanCustom Multimodal Video Generation System

Tencent Releases HunyuanCustom Multimodal Video Generation System

Tencent has recently unveiled an innovative video generation technology—HunyuanCustom, a multimodal video customization framework capable of maintaining subject consistency while supporting various input conditions including text, images, audio, and video. The technology has been open-sourced along with relevant models and code, bringing new possibilities to video content creation.

HunyuanCustom Overall Architecture

Technical Innovation

Built upon the Hunyuan Video generation framework, HunyuanCustom focuses on addressing two major challenges in current video generation technology: identity consistency and limited input modalities. The technology introduces several key innovations:

  1. Text-Image Fusion Module: Based on LLaVA technology, enhancing multimodal understanding capabilities
  2. Image ID Enhancement Module: Leveraging temporal concatenation to reinforce identity features across frames
  3. Modality-Specific Condition Injection Mechanisms:
    • AudioNet Module: Achieving hierarchical alignment via spatial cross-attention
    • Video-Driven Injection Module: Integrating conditional video through a patchify-based feature-alignment network

These technical innovations enable HunyuanCustom to significantly outperform existing open and closed-source methods in terms of identity consistency, realism, and text-video alignment.

Multimodal Video Customization Capabilities

HunyuanCustom supports various forms of input, specifically including:

  • Text and Image Input: Can handle single or multiple image inputs to enable customized video generation for one or more subjects
  • Audio Input: Can incorporate additional audio input to drive the subject to speak the corresponding audio content
  • Video Input: Supports video input, allowing for the replacement of specific objects in the video with subjects from a given image

HunyuanCustom Multimodal Capabilities

Application Scenarios

The multimodal capabilities of HunyuanCustom support various downstream tasks:

  • Virtual Human Advertisements: Creating product showcase videos by inputting multiple images
  • Virtual Try-On: Generating videos of people wearing specific clothing
  • Singing Avatars: Creating virtual characters that sing by combining image and audio
  • Video Editing: Using image and video as inputs to replace subjects in videos

HunyuanCustom Application Scenarios

Performance Comparison

HunyuanCustom was compared with state-of-the-art video customization methods including VACE, Skyreels, Pika, Vidu, Keling, and Hailuo. The evaluation focused on face/subject consistency, video-text alignment, and overall video quality.

In terms of key metrics, HunyuanCustom demonstrated significant advantages:

  • Face Similarity (Face-Sim): 0.627 (Ranked 1st)
  • DINO Similarity: 0.593 (Ranked 1st)
  • Temporal Consistency: 0.958 (Close to the best)

Open-Source Plan

Tencent released the inference code and model weights of HunyuanCustom on May 8, 2025. According to the open-source plan, the team will progressively release:

  • Single-Subject Video Customization
    • Inference code (already released)
    • Model checkpoints (already released)
    • ComfyUI plugin
  • Audio-Driven Video Customization
  • Video-Driven Video Customization
  • Multi-Subject Video Customization

System Requirements

The system requirements for generating videos with the HunyuanCustom model are as follows:

ModelSetting (height/width/frames)GPU Peak Memory
HunyuanCustom720px1280px129f80GB
HunyuanCustom512px896px129f60GB
  • Minimum requirement: At least 24GB of VRAM is needed to generate 720p videos (but very slow)
  • Recommended configuration: A GPU with 80GB of memory is recommended for better generation quality