Skip to content
Help ComfyUI Wiki remove ads Become a Patron
NewsByteDance Open Sources LatentSync - High-Precision Lip Sync Technology Based on Diffusion Model

ByteDance Open Sources LatentSync - High-Precision Lip Sync Technology Based on Diffusion Model

ByteDance recently open-sourced an innovative lip synchronization tool called LatentSync on GitHub. This end-to-end lip sync framework, based on an audio-conditioned latent space diffusion model, not only achieves high-precision audio-visual synchronization but also resolves common frame jittering issues found in traditional methods.

Technical Innovations

LatentSync’s main technical innovations include:

  1. End-to-End Latent Space Diffusion Model

    • No intermediate motion representations needed
    • Direct modeling of complex audio-visual relationships in latent space
    • Leverages the powerful capabilities of Stable Diffusion
  2. Temporal Consistency Optimization

    • Introduces innovative Temporal Representation Alignment (TREPA) technology
    • Uses large-scale self-supervised video models for temporal feature extraction
    • Effectively improves temporal coherence in generated videos

Complete Toolchain

LatentSync provides a comprehensive video processing toolchain:

  • Preprocessing Tools

    • Video frame rate resampling (25fps)
    • Audio resampling (16000Hz)
    • Scene detection and segmentation
    • Face detection and alignment
  • Quality Assurance

    • Face size and count verification
    • Audio-visual sync confidence assessment
    • hyperIQA image quality scoring

Wide Applicability

LatentSync demonstrates excellent versatility:

  • Real Person Videos: Accurately captures and reproduces real human lip movements
  • Animated Characters: Equally applicable to lip syncing for animated characters
  • Low Resource Requirements: Requires only about 6.5GB VRAM for inference

Open Source and Community

The project is open-sourced on GitHub, providing:

  • Inference code and pre-trained models
  • Complete data processing pipeline
  • Training code and configuration files

Application Prospects

LatentSync’s release brings new possibilities to video production:

  • Video post-production
  • Multilingual dubbing localization
  • Virtual presenter content generation
  • Educational video production

References