ByteDance Releases Sa2VA: First Unified Image-Video Understanding Model

10/17/2025

Tencent Releases HunyuanCustom Multimodal Video Generation System

Tencent has recently unveiled an innovative video generation technology—HunyuanCustom, a multimodal video customization framework capable of maintaining subject consistency while supporting various input conditions including text, images, audio, and video. The technology has been open-sourced along with relevant models and code, bringing new possibilities to video content creation.

HunyuanCustom Overall Architecture

Technical Innovation

Built upon the Hunyuan Video generation framework, HunyuanCustom focuses on addressing two major challenges in current video generation technology: identity consistency and limited input modalities. The technology introduces several key innovations:

Text-Image Fusion Module: Based on LLaVA technology, enhancing multimodal understanding capabilities
Image ID Enhancement Module: Leveraging temporal concatenation to reinforce identity features across frames
Modality-Specific Condition Injection Mechanisms:
- AudioNet Module: Achieving hierarchical alignment via spatial cross-attention
- Video-Driven Injection Module: Integrating conditional video through a patchify-based feature-alignment network

These technical innovations enable HunyuanCustom to significantly outperform existing open and closed-source methods in terms of identity consistency, realism, and text-video alignment.

Multimodal Video Customization Capabilities

HunyuanCustom supports various forms of input, specifically including:

Text and Image Input: Can handle single or multiple image inputs to enable customized video generation for one or more subjects
Audio Input: Can incorporate additional audio input to drive the subject to speak the corresponding audio content
Video Input: Supports video input, allowing for the replacement of specific objects in the video with subjects from a given image

HunyuanCustom Multimodal Capabilities

Application Scenarios

The multimodal capabilities of HunyuanCustom support various downstream tasks:

Virtual Human Advertisements: Creating product showcase videos by inputting multiple images
Virtual Try-On: Generating videos of people wearing specific clothing
Singing Avatars: Creating virtual characters that sing by combining image and audio
Video Editing: Using image and video as inputs to replace subjects in videos

HunyuanCustom Application Scenarios

Performance Comparison

HunyuanCustom was compared with state-of-the-art video customization methods including VACE, Skyreels, Pika, Vidu, Keling, and Hailuo. The evaluation focused on face/subject consistency, video-text alignment, and overall video quality.

In terms of key metrics, HunyuanCustom demonstrated significant advantages:

Face Similarity (Face-Sim): 0.627 (Ranked 1st)
DINO Similarity: 0.593 (Ranked 1st)
Temporal Consistency: 0.958 (Close to the best)

Open-Source Plan

Tencent released the inference code and model weights of HunyuanCustom on May 8, 2025. According to the open-source plan, the team will progressively release:

Single-Subject Video Customization
- Inference code (already released)
- Model checkpoints (already released)
- ComfyUI plugin
Audio-Driven Video Customization
Video-Driven Video Customization
Multi-Subject Video Customization

System Requirements

The system requirements for generating videos with the HunyuanCustom model are as follows:

Model	Setting (height/width/frames)	GPU Peak Memory
HunyuanCustom	720px1280px129f	80GB
HunyuanCustom	512px896px129f	60GB

Minimum requirement: At least 24GB of VRAM is needed to generate 720p videos (but very slow)
Recommended configuration: A GPU with 80GB of memory is recommended for better generation quality

RunningHub

RunComfy

Comfy Deploy

Comfy Online

Comfy.ICU

InstaSD

ByteDance Releases Sa2VA: First Unified Image-Video Understanding Model

Tencent Releases HunyuanCustom Multimodal Video Generation System

Technical Innovation

Multimodal Video Customization Capabilities

Application Scenarios

Performance Comparison

Open-Source Plan

System Requirements

RunningHub

RunComfy

Comfy Deploy

Comfy Online

Comfy.ICU

InstaSD

ByteDance Releases Sa2VA: First Unified Image-Video Understanding Model

Tencent Releases HunyuanCustom Multimodal Video Generation System

Technical Innovation

Multimodal Video Customization Capabilities

Application Scenarios

Performance Comparison

Open-Source Plan

System Requirements

Related Links