Wan2.2-S2V Audio-Driven Video Generation ComfyUI workflow and tutorial

Wan2.2-S2V represents a significant advancement in AI video generation technology, capable of creating dynamic video content from static images and audio inputs. This innovative model excels at producing synchronized videos with natural lip-sync, making it particularly valuable for content creators working on dialogue scenes, musical performances, and character-driven narratives.

Model Highlights

Audio-Driven Video Generation: Transforms static images and audio into synchronized videos with natural lip-sync and expressions
Cinematic-Grade Quality: Generates film-quality videos with authentic facial expressions, body movements, and camera language
Minute-Level Generation: Supports long-form video creation up to minute-level duration in a single generation
Multi-Format Support: Works with real people, cartoons, animals, digital humans, and supports portrait, half-body, and full-body formats
Enhanced Motion Control: Generates actions and environments from text instructions with AdaIN and CrossAttention control mechanisms
High Performance Metrics: Achieves FID 15.66, CSIM 0.677, and SSIM 0.734 for superior video quality and identity consistency

Wan2.2 S2V ComfyUI Native Workflow

1. Download Workflow File

Download the following workflow file and drag it into ComfyUI to load the workflow.

Download the following image and audio as input: input

2. Model Links

You can find the models in our repo

diffusion_models

audio_encoders

wav2vec2_large_english_fp16.safetensors

vae

wan_2.1_vae.safetensors

text_encoders

umt5_xxl_fp8_e4m3fn_scaled.safetensors

ComfyUI/
├───📂 models/
│   ├───📂 diffusion_models/
│   │   ├─── wan2.2_s2v_14B_fp8_scaled.safetensors
│   │   └─── wan2.2_s2v_14B_bf16.safetensors
│   ├───📂 text_encoders/
│   │   └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors 
│   ├───📂 audio_encoders/ # Create one if you can't find this folder
│   │   └─── wav2vec2_large_english_fp16.safetensors 
│   └───📂 vae/
│       └── wan_2.1_vae.safetensors

3. Workflow Instructions

Workflow Instructions

3.1 Lightning LoRA（Optional, for acceleration）

Lightning LoRA reduces generation time from 20 steps to 4 steps but may affect quality. Use for quick previews, disable for final output.

3.1.1 Audio Preprocessing Tips

Voice Separation for Better Results: Since ComfyUI core doesn’t include voice separation nodes, we recommend using external tools to separate vocals from background music before processing. This is especially important for dialogue and lip-sync generation, as clean vocal tracks produce significantly better results than mixed audio with background music or noise.

3.2 About fp8_scaled and bf16 Models

You can find both models here:

The template uses wan2.2_s2v_14B_fp8_scaled.safetensors for lower VRAM usage. Try wan2.2_s2v_14B_bf16.safetensors for better quality.

3.3 Step-by-Step Operation Instructions

Step 1: Load Models

Load Diffusion Model: Load wan2.2_s2v_14B_fp8_scaled.safetensors or wan2.2_s2v_14B_bf16.safetensors
- The workflow uses wan2.2_s2v_14B_fp8_scaled.safetensors for lower VRAM requirements
- Use wan2.2_s2v_14B_bf16.safetensors for better quality output
Load CLIP: Load umt5_xxl_fp8_e4m3fn_scaled.safetensors
Load VAE: Load wan_2.1_vae.safetensors
AudioEncoderLoader: Load wav2vec2_large_english_fp16.safetensors
LoraLoaderModelOnly: Load wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors (Lightning LoRA)
- This LoRA reduces generation time but may affect quality
- Disable if output quality is insufficient
LoadAudio: Upload the provided audio file or your own audio
Load Image: Upload reference image
Batch sizes: Set according to the number of Video S2V Extend subgraph nodes
- Each Video S2V Extend subgraph adds 77 frames to the output
- Example: 2 Video S2V Extend subgraphs = batch size 3
- Chunk Length: Keep default value of 77
Sampler Settings: Choose based on Lightning LoRA usage
- With 4-step Lightning LoRA: steps: 4, cfg: 1.0
- Without Lightning LoRA: steps: 20, cfg: 6.0
Size Settings: Set the output video dimensions
Video S2V Extend: Video extension subgraph nodes
- Each extension generates 77 / 16 = 4.8125 seconds of video
- Calculate nodes needed: audio length (seconds) × 16 ÷ 77
- Example: 14s audio = 224 frames ÷ 77 = 3 extension nodes
Use Ctrl-Enter or click the Run button to execute the workflow

Wan2.2 S2V Code: GitHub
Wan2.2 S2V Model: Hugging Face

Alibaba Tongyi Lab Releases Z-Image-Turbo - Efficient 6B Parameter Image Generation Model