Skip to content
Become a Patron Help Build a Better ComfyUI Knowledge Base
NewsStepFun releases Step-Video-T2V: A 300 Billion Parameter Text-to-Video Model

StepFun releases Step-Video-T2V: A 300 Billion Parameter Text-to-Video Model

On February 17, 2025, StepFun officially released its latest text-to-video model, Step-Video-T2V. This model has 300 billion parameters and can generate high-quality videos up to 204 frames. This is one of the largest text-to-video models in the open-source community.

Model Features

  • Ultra-Large-Scale Parameters: With 300 billion parameters, it supports the generation of videos up to 204 frames
  • High Compression Ratio: It uses deep compression VAE technology to achieve a 16x16 spatial compression and an 8x time compression ratio
  • Bilingual Support: It has a built-in Chinese-English text encoder, which perfectly supports Chinese prompt words
  • Open Source License: It is open-source under the MIT license and supports commercial use
  • Optimization Technology: It uses Direct Preference Optimization (DPO) technology to improve video generation quality

Hardware Requirements

The official recommendation is to use a GPU with 80GB of memory to run the model for the best generation effect. The specific hardware requirements are as follows:

  • 544px992px204 frames video generation: Requires 77.64GB of memory
  • 544px992px136 frames video generation: Requires 72.48GB of memory

Online Experience

Currently, Step-Video-T2V is available on the Yuewen Video Platform for public experience. The platform supports the generation of smooth videos for 8 seconds, but there may be a queue for waiting.

Open Source Address

The StepFun team stated that the code of this model will be integrated into the official Diffusers library of Hugging Face, and they will continue to optimize the model’s performance and user experience. For users who want to deploy locally, the team also provides detailed installation and usage documents.