Skip to content
Help Build a Better ComfyUI Knowledge Base Become a Patron
NewsSand AI Releases MAGI-1: Autoregressive Video Generation at Scale

Sand AI Releases MAGI-1: Autoregressive Video Generation at Scale

MAGI-1 logo

Sand AI team officially open-sourced the MAGI-1 video generation model on April 21, with plans to release a 4.5B parameter version by the end of April. This is a world model capable of predicting video chunk sequences autoregressively, supporting Text-to-Video (T2V), Image-to-Video (I2V), and Video-to-Video (V2V) generation methods.

Technical Innovations

MAGI-1 employs multiple technical innovations that give it unique advantages in the field of video generation:

Transformer-based VAE

  • Uses a Transformer-based variational autoencoder with 8x spatial and 4x temporal compression
  • Features the fastest average decoding time while maintaining high-quality reconstruction

Autoregressive Denoising Algorithm

MAGI-1 generates videos in an autoregressive, chunk-by-chunk manner rather than all at once. Each chunk (24 frames) is denoised holistically, and the generation of the next chunk begins as soon as the current one reaches a certain level of denoising. This design enables concurrent processing of up to four chunks for efficient video generation.

Autoregressive denoising algorithm

Diffusion Model Architecture

MAGI-1 is built upon the Diffusion Transformer, incorporating several key innovations to enhance training efficiency and stability at scale. These advancements include Block-Causal Attention, Parallel Attention Block, QK-Norm and GQA, Sandwich Normalization in FFN, SwiGLU, and Softcap Modulation.

Diffusion model architecture

Distillation Algorithm

The model adopts a shortcut distillation approach that trains a single velocity-based model to support variable inference budgets. By enforcing a self-consistency constraint—equating one large step with two smaller steps—the model learns to approximate flow-matching trajectories across multiple step sizes. During training, step sizes are cyclically sampled from 64, 32, 16, 8, and classifier-free guidance distillation is incorporated to preserve conditional alignment. This enables efficient inference with minimal loss in fidelity.

Model Versions

Sand AI provides pre-trained weights for multiple versions of MAGI-1, including 24B and 4.5B models, as well as corresponding distilled and quantized models:

ModelRecommended Hardware
MAGI-1-24BH100/H800 × 8
MAGI-1-24B-distillH100/H800 × 8
MAGI-1-24B-distill+fp8_quantH100/H800 × 4 or RTX 4090 × 8
MAGI-1-4.5BRTX 4090 × 1

Performance Evaluation

Physical Evaluation

Thanks to the natural advantages of autoregressive architecture, MAGI-1 achieves far superior precision in predicting physical behavior on the Physics-IQ benchmark through video continuation.

In Physics-IQ scoring, MAGI’s Video-to-Video (V2V) mode achieves 56.02 points, while its Image-to-Video (I2V) mode reaches 30.23 points, significantly outperforming other open-source and closed-source commercial models such as VideoPoet, Kling1.6, and Sora.

How to Run

MAGI-1 supports running through either Docker environment (recommended) or source code. Users can flexibly control input and output by adjusting parameters in the run.sh script to meet different requirements:

  • --mode: Specifies the operation mode (t2v, i2v, or v2v)
  • --prompt: The text prompt used for video generation
  • --image_path: Path to the image file (used only in i2v mode)
  • --prefix_video_path: Path to the prefix video file (used only in v2v mode)
  • --output_path: Path where the generated video file will be saved