HunyuanVideo Text-to-Video Workflow Guide and Examples
This tutorial will provide detailed instructions on how to use Tencent’s Hunyuan Video model in ComfyUI for text-to-video generation. We’ll guide you through the process step by step, starting with environment setup.
1. Hardware Requirements
Before getting started, please ensure your system meets these minimum requirements:
- GPU: NVIDIA GPU with CUDA support
- Minimum: 60GB VRAM (for generating 720p×1280p×129 frames video)
- Recommended: 80GB VRAM (for better generation quality)
- Minimum usable: 45GB VRAM (for generating 544p×960p×129 frames video)
- Operating System: Linux (official test environment)
- CUDA Version: CUDA 11.8 or 12.0+ recommended
Hardware requirements source: https://huggingface.co/tencent/HunyuanVideo
1. Install and Update ComfyUI to Latest Version
If you haven’t installed ComfyUI yet, please refer to these sections:
ComfyUI Installation Guide ComfyUI Update Guide
You’ll need to install and update ComfyUI to the latest version to access the ‘EmptyHunyuanLatentVideo’ node.
2. Model Download and Installation
HunyuanVideo requires the following model files:
2.1 Main Model File
Download the following file from HunyuanVideo Main Model Download Page:
Filename | Size | Directory |
---|---|---|
hunyuan_video_t2v_720p_bf16.safetensors | ~25.6GB | ComfyUI/models/diffusion_models |
2.2 Text Encoder Files
Download the following files from HunyuanVideo Text Encoder Download Page:
Filename | Size | Directory |
---|---|---|
clip_l.safetensors | ~246MB | ComfyUI/models/text_encoders |
llava_llama3_fp8_scaled.safetensors | ~9.09GB | ComfyUI/models/text_encoders |
2.3 VAE Model File
Download the following file from HunyuanVideo VAE Download Page:
Filename | Size | Directory |
---|---|---|
hunyuan_video_vae_bf16.safetensors | ~493MB | ComfyUI/models/vae |
Model Directory Structure Reference
ComfyUI/
├── models/
│ ├── diffusion_models/
│ │ └── hunyuan_video_t2v_720p_bf16.safetensors # Main model file
│ ├── text_encoders/
│ │ ├── clip_l.safetensors # CLIP text encoder
│ │ └── llava_llama3_fp8_scaled.safetensors # LLaVA text encoder
│ └── vae/
│ └── hunyuan_video_vae_bf16.safetensors # VAE model file
3. Workflow File Download
Workflow file source: HunyuanVideo Workflow Download
Basic Video Generation Workflow
HunyuanVideo supports the following resolution settings:
Resolution | 9:16 Ratio | 16:9 Ratio | 4:3 Ratio | 3:4 Ratio | 1:1 Ratio |
---|---|---|---|---|---|
540p | 544×960×129f | 960×544×129f | 624×832×129f | 832×624×129f | 720×720×129f |
720p (Recommended) | 720×1280×129f | 1280×720×129f | 1104×832×129f | 832×1104×129f | 960×960×129f |
4. Workflow Node Explanation
4.1 Model Loading Nodes
-
UNETLoader
- Purpose: Load the main model file
- Parameters:
- Model:
hunyuan_video_t2v_720p_bf16.safetensors
- Weight Type:
default
(can choose fp8 type if memory is insufficient)
- Model:
-
DualCLIPLoader
- Purpose: Load text encoder models
- Parameters:
- CLIP 1:
clip_l.safetensors
- CLIP 2:
llava_llama3_fp8_scaled.safetensors
- Text Encoder:
hunyuan_video
- CLIP 1:
-
VAELoader
- Purpose: Load VAE model
- Parameters:
- VAE Model:
hunyuan_video_vae_bf16.safetensors
- VAE Model:
4.2 Key Video Generation Nodes
-
EmptyHunyuanLatentVideo
- Purpose: Create video latent space
- Parameters:
- Width: Video width (e.g., 848)
- Height: Video height (e.g., 480)
- Frame Count: Number of frames (e.g., 73)
- Batch Size: Batch size (default 1)
-
CLIPTextEncode
- Purpose: Text prompt encoding
- Parameters:
- Text: Positive prompts (describe what you want to generate)
- Recommended to use detailed English descriptions
-
FluxGuidance
- Purpose: Control generation guidance strength
- Parameters:
- Guidance Scale: Guidance strength (default 6.0)
- Higher values make results closer to prompts but may affect video quality
-
KSamplerSelect
- Purpose: Select sampler
- Parameters:
- Sampler: Sampling method (default
euler
) - Other options:
euler_ancestral
,dpm++_2m
, etc.
- Sampler: Sampling method (default
-
BasicScheduler
- Purpose: Set sampling scheduler
- Parameters:
- Scheduler: Scheduling method (default
simple
) - Steps: Sampling steps (recommended 20-30)
- Denoise: Denoising strength (default 1.0)
- Scheduler: Scheduling method (default
4.3 Video Decoding and Saving Nodes
-
VAEDecodeTiled
- Purpose: Decode latent space video to actual video
- Parameters:
- Tile Size: 256 (can be reduced if memory is insufficient)
- Overlap: 64 (can be reduced if memory is insufficient)
Note: Prefer VAEDecodeTiled over VAEDecode as it’s more memory efficient
-
SaveAnimatedWEBP
- Purpose: Save generated video
- Parameters:
- Filename Prefix: File name prefix
- FPS: Frame rate (default 24)
- Lossless: Whether lossless (default false)
- Quality: Quality (0-100, default 80)
- Filter Type: Filter type (default
default
)
5. Parameter Optimization Tips
5.1 Memory Optimization
If encountering memory issues:
- Choose fp8 weight type in UNETLoader
- Reduce tile_size and overlap parameters in VAEDecodeTiled
- Use lower video resolution and frame count
5.2 Generation Quality Optimization
-
Prompt Optimization
[Subject Description], [Action Description], [Scene Description], [Style Description], [Quality Requirements]
Example:
anime style anime girl with massive fennec ears and one big fluffy tail, she has blonde hair long hair blue eyes wearing a pink sweater and a long blue skirt walking in a beautiful outdoor scenery with snow mountains in the background
-
Parameter Adjustments
- Increase sampling steps for better quality
- Appropriately increase Guidance Scale for better text adherence
- Adjust FPS and video quality parameters as needed
6. Common Issues
-
Insufficient Memory
- Refer to memory optimization section suggestions
- Close other memory-consuming programs
- Use lower video resolution settings
-
Slow Generation Speed
- This is normal, video generation takes time
- Can reduce sampling steps and frame count
- Use lower resolution to increase speed
-
Quality Issues
- Optimize prompt descriptions
- Increase sampling steps
- Adjust Guidance Scale
- Try different samplers