StepFun releases Step-Video-T2V: A 300 Billion Parameter Text-to-Video Model

On February 17, 2025, StepFun officially released its latest text-to-video model, Step-Video-T2V. This model has 300 billion parameters and can generate high-quality videos up to 204 frames. This is one of the largest text-to-video models in the open-source community.

Model Features

Ultra-Large-Scale Parameters: With 300 billion parameters, it supports the generation of videos up to 204 frames
High Compression Ratio: It uses deep compression VAE technology to achieve a 16x16 spatial compression and an 8x time compression ratio
Bilingual Support: It has a built-in Chinese-English text encoder, which perfectly supports Chinese prompt words
Open Source License: It is open-source under the MIT license and supports commercial use
Optimization Technology: It uses Direct Preference Optimization (DPO) technology to improve video generation quality

Hardware Requirements

The official recommendation is to use a GPU with 80GB of memory to run the model for the best generation effect. The specific hardware requirements are as follows:

544px992px204 frames video generation: Requires 77.64GB of memory
544px992px136 frames video generation: Requires 72.48GB of memory

Online Experience

Currently, Step-Video-T2V is available on the Yuewen Video Platform for public experience. The platform supports the generation of smooth videos for 8 seconds, but there may be a queue for waiting.

Open Source Address

Model Download: Hugging Face
Technical Report: arXiv:2502.10248

The StepFun team stated that the code of this model will be integrated into the official Diffusers library of Hugging Face, and they will continue to optimize the model's performance and user experience. For users who want to deploy locally, the team also provides detailed installation and usage documents.

Model Features

Hardware Requirements

Online Experience

Open Source Address

Comments