Skip to content
Follow me on X
ComfyUI Wiki
NewsOpenMOSS Releases MOVA - Open-Source Synchronized Video and Audio Generation Model

OpenMOSS Releases MOVA - Open-Source Synchronized Video and Audio Generation Model

On January 29, 2026, the OpenMOSS team from Shanghai Chuangzhi Academy, in collaboration with MOSI Intelligence, officially released the end-to-end video and audio generation model MOVA (MOSS Video and Audio). The model synchronously generates video and audio in a single inference pass, avoiding error accumulation issues of cascaded pipelines, achieving advanced performance in lip-sync and environmental sound effects.

Model Positioning

MOVA is a foundation model designed to address the audio gap in open-source video generation. Through end-to-end modality fusion, the model simultaneously generates high-fidelity video and synchronized audio in a single inference process, ensuring perfect alignment.

Technical Architecture

Asymmetric Dual-Tower Architecture

MOVA adopts an asymmetric dual-tower architecture, fusing pre-trained video and audio towers through a bidirectional cross-attention mechanism. This design enables the model to maintain tight synchronization between video and audio during generation.

Model Versions

The project open-sources two resolution versions:

  • MOVA-360p: Suitable for fast inference and resource-constrained environments
  • MOVA-720p: Provides higher resolution video generation

Both versions support generating up to 8 seconds of video-audio content.

Core Features

Native Bimodal Generation

MOVA generates high-fidelity video and synchronized audio in a single inference pass, avoiding error accumulation and synchronization issues of traditional cascaded methods.

Precise Lip-Sync

The model demonstrates excellent performance in multilingual lip synchronization. On Verse-Bench Set3 evaluation:

  • With dual CFG enabled, LSE-D score of 7.094
  • LSE-C score of 7.452

Environment-Aware Sound Effects

The model can generate corresponding environmental sound effects based on video content, including:

  • Physical interaction sounds (such as vehicle engine sounds, wind sounds)
  • Environmental ambient sounds (such as street reverberation, equipment friction sounds)
  • Spatial and textural sound feedback

Performance

Verse-Bench Evaluation

The model was comprehensively evaluated on the Verse-Bench benchmark:

  • Audio-Video Alignment: Evaluated on all subsets
  • Lip-Sync: Evaluated on Set3
  • Speech Quality: Evaluated on Set3
  • ASR Accuracy: Evaluated on multi-speaker subset

Human Evaluation

The project provides Elo scores and win rate data comparing MOVA with existing open-source models.

Inference Performance

For generating an 8-second 360p video, performance benchmarks under different offloading strategies:

  • VRAM usage varies by offloading strategy
  • Host RAM usage
  • Hardware step time

Actual performance may vary depending on hardware configuration.

LoRA Fine-tuning Support

MOVA provides complete LoRA fine-tuning scripts, supporting multiple training modes:

Training Configurations (360p, 8-second video)

  • Low-resource LoRA: Reduces VRAM requirements
  • Accelerate LoRA: Improves training speed
  • Accelerate + FSDP LoRA: Distributed training support

Peak usage data for each mode includes VRAM/GPU, host RAM, and step time.

Application Scenarios

MOVA is suitable for the following scenarios:

  • Video-Audio Content Creation: Generate video content with synchronized audio
  • Lip Synchronization: Add precise speech synchronization to videos
  • Sound Effect Generation: Generate environment-aware sound effects for videos
  • Multilingual Dubbing: Support multilingual lip-sync generation

Fully Open-Source

MOVA uses the Apache-2.0 open-source license, fully releasing:

  • Model Weights: Both 360p and 720p versions
  • Inference Code: Complete inference implementation
  • Training Pipeline: End-to-end training process
  • LoRA Fine-tuning Scripts: Support for custom fine-tuning

This full-stack open-source strategy enables the community to collaboratively improve the model and advance video-audio generation technology.

Technical Significance

Against the backdrop of top technologies like Sora 2 and Veo 3 moving toward closed-source, MOVA’s open-source release fills the gap in open-source video-audio generation foundation models. By providing complete model weights and training code, MOVA provides the community with a foundation for improving and customizing video-audio generation capabilities.