Skip to content
ComfyUI Wiki
Help Build a Better ComfyUI Knowledge Base Become a Patron

Wan2.2-S2V Audio-Driven Video Generation ComfyUI workflow and tutorial

Wan2.2-S2V represents a significant advancement in AI video generation technology, capable of creating dynamic video content from static images and audio inputs. This innovative model excels at producing synchronized videos with natural lip-sync, making it particularly valuable for content creators working on dialogue scenes, musical performances, and character-driven narratives.

Model Highlights

  • Audio-Driven Video Generation: Transforms static images and audio into synchronized videos with natural lip-sync and expressions
  • Cinematic-Grade Quality: Generates film-quality videos with authentic facial expressions, body movements, and camera language
  • Minute-Level Generation: Supports long-form video creation up to minute-level duration in a single generation
  • Multi-Format Support: Works with real people, cartoons, animals, digital humans, and supports portrait, half-body, and full-body formats
  • Enhanced Motion Control: Generates actions and environments from text instructions with AdaIN and CrossAttention control mechanisms
  • High Performance Metrics: Achieves FID 15.66, CSIM 0.677, and SSIM 0.734 for superior video quality and identity consistency

Wan2.2 S2V ComfyUI Native Workflow

Loading...

1. Download Workflow File

Download the following workflow file and drag it into ComfyUI to load the workflow.

Download the following image and audio as input: input

You can find the models in our repo

diffusion_models

audio_encoders

vae

text_encoders

ComfyUI/
├───📂 models/
│   ├───📂 diffusion_models/
│   │   ├─── wan2.2_s2v_14B_fp8_scaled.safetensors
│   │   └─── wan2.2_s2v_14B_bf16.safetensors
│   ├───📂 text_encoders/
│   │   └─── umt5_xxl_fp8_e4m3fn_scaled.safetensors 
│   ├───📂 audio_encoders/ # Create one if you can't find this folder
│   │   └─── wav2vec2_large_english_fp16.safetensors 
│   └───📂 vae/
│       └── wan_2.1_vae.safetensors

3. Workflow Instructions

Workflow Instructions

3.1 Lightning LoRA(Optional, for acceleration)

Lightning LoRA reduces generation time from 20 steps to 4 steps but may affect quality. Use for quick previews, disable for final output.

3.1.1 Audio Preprocessing Tips

Voice Separation for Better Results: Since ComfyUI core doesn’t include voice separation nodes, we recommend using external tools to separate vocals from background music before processing. This is especially important for dialogue and lip-sync generation, as clean vocal tracks produce significantly better results than mixed audio with background music or noise.

3.2 About fp8_scaled and bf16 Models

You can find both models here:

The template uses wan2.2_s2v_14B_fp8_scaled.safetensors for lower VRAM usage. Try wan2.2_s2v_14B_bf16.safetensors for better quality.

3.3 Step-by-Step Operation Instructions

Step 1: Load Models

  1. Load Diffusion Model: Load wan2.2_s2v_14B_fp8_scaled.safetensors or wan2.2_s2v_14B_bf16.safetensors
    • The workflow uses wan2.2_s2v_14B_fp8_scaled.safetensors for lower VRAM requirements
    • Use wan2.2_s2v_14B_bf16.safetensors for better quality output
  2. Load CLIP: Load umt5_xxl_fp8_e4m3fn_scaled.safetensors
  3. Load VAE: Load wan_2.1_vae.safetensors
  4. AudioEncoderLoader: Load wav2vec2_large_english_fp16.safetensors
  5. LoraLoaderModelOnly: Load wan2.2_t2v_lightx2v_4steps_lora_v1.1_high_noise.safetensors (Lightning LoRA)
    • This LoRA reduces generation time but may affect quality
    • Disable if output quality is insufficient
  6. LoadAudio: Upload the provided audio file or your own audio
  7. Load Image: Upload reference image
  8. Batch sizes: Set according to the number of Video S2V Extend subgraph nodes
    • Each Video S2V Extend subgraph adds 77 frames to the output
    • Example: 2 Video S2V Extend subgraphs = batch size 3
    • Chunk Length: Keep default value of 77
  9. Sampler Settings: Choose based on Lightning LoRA usage
    • With 4-step Lightning LoRA: steps: 4, cfg: 1.0
    • Without Lightning LoRA: steps: 20, cfg: 6.0
  10. Size Settings: Set the output video dimensions
  11. Video S2V Extend: Video extension subgraph nodes
    • Each extension generates 77 / 16 = 4.8125 seconds of video
    • Calculate nodes needed: audio length (seconds) × 16 ÷ 77
    • Example: 14s audio = 224 frames ÷ 77 = 3 extension nodes
  12. Use Ctrl-Enter or click the Run button to execute the workflow