OpenMOSS Releases MOVA - Open-Source Synchronized Video and Audio Generation Model
On January 29, 2026, the OpenMOSS team from Shanghai Chuangzhi Academy, in collaboration with MOSI Intelligence, officially released the end-to-end video and audio generation model MOVA (MOSS Video and Audio). The model synchronously generates video and audio in a single inference pass, avoiding error accumulation issues of cascaded pipelines, achieving advanced performance in lip-sync and environmental sound effects.
Model Positioning
MOVA is a foundation model designed to address the audio gap in open-source video generation. Through end-to-end modality fusion, the model simultaneously generates high-fidelity video and synchronized audio in a single inference process, ensuring perfect alignment.
Technical Architecture
Asymmetric Dual-Tower Architecture
MOVA adopts an asymmetric dual-tower architecture, fusing pre-trained video and audio towers through a bidirectional cross-attention mechanism. This design enables the model to maintain tight synchronization between video and audio during generation.
Model Versions
The project open-sources two resolution versions:
- MOVA-360p: Suitable for fast inference and resource-constrained environments
- MOVA-720p: Provides higher resolution video generation
Both versions support generating up to 8 seconds of video-audio content.
Core Features
Native Bimodal Generation
MOVA generates high-fidelity video and synchronized audio in a single inference pass, avoiding error accumulation and synchronization issues of traditional cascaded methods.
Precise Lip-Sync
The model demonstrates excellent performance in multilingual lip synchronization. On Verse-Bench Set3 evaluation:
- With dual CFG enabled, LSE-D score of 7.094
- LSE-C score of 7.452
Environment-Aware Sound Effects
The model can generate corresponding environmental sound effects based on video content, including:
- Physical interaction sounds (such as vehicle engine sounds, wind sounds)
- Environmental ambient sounds (such as street reverberation, equipment friction sounds)
- Spatial and textural sound feedback
Performance
Verse-Bench Evaluation
The model was comprehensively evaluated on the Verse-Bench benchmark:
- Audio-Video Alignment: Evaluated on all subsets
- Lip-Sync: Evaluated on Set3
- Speech Quality: Evaluated on Set3
- ASR Accuracy: Evaluated on multi-speaker subset
Human Evaluation
The project provides Elo scores and win rate data comparing MOVA with existing open-source models.
Inference Performance
For generating an 8-second 360p video, performance benchmarks under different offloading strategies:
- VRAM usage varies by offloading strategy
- Host RAM usage
- Hardware step time
Actual performance may vary depending on hardware configuration.
LoRA Fine-tuning Support
MOVA provides complete LoRA fine-tuning scripts, supporting multiple training modes:
Training Configurations (360p, 8-second video)
- Low-resource LoRA: Reduces VRAM requirements
- Accelerate LoRA: Improves training speed
- Accelerate + FSDP LoRA: Distributed training support
Peak usage data for each mode includes VRAM/GPU, host RAM, and step time.
Application Scenarios
MOVA is suitable for the following scenarios:
- Video-Audio Content Creation: Generate video content with synchronized audio
- Lip Synchronization: Add precise speech synchronization to videos
- Sound Effect Generation: Generate environment-aware sound effects for videos
- Multilingual Dubbing: Support multilingual lip-sync generation
Fully Open-Source
MOVA uses the Apache-2.0 open-source license, fully releasing:
- Model Weights: Both 360p and 720p versions
- Inference Code: Complete inference implementation
- Training Pipeline: End-to-end training process
- LoRA Fine-tuning Scripts: Support for custom fine-tuning
This full-stack open-source strategy enables the community to collaboratively improve the model and advance video-audio generation technology.
Technical Significance
Against the backdrop of top technologies like Sora 2 and Veo 3 moving toward closed-source, MOVA’s open-source release fills the gap in open-source video-audio generation foundation models. By providing complete model weights and training code, MOVA provides the community with a foundation for improving and customizing video-audio generation capabilities.
Related Links
- GitHub Repository: https://github.com/OpenMOSS/MOVA
- HuggingFace Model: https://huggingface.co/OpenMOSS/MOVA
- Project Homepage: https://openmoss.github.io/MOVA/