Skip to content
Help Build a Better ComfyUI Knowledge Base Become a Patron
News2025 05 22 Bagel

title: “BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation” description: “ByteDance releases BAGEL, an open-source multimodal foundation model with 7B active parameters, supporting understanding and generation across text, image, and video, and achieving strong results on public benchmarks.” tag: open-source, bytedance date: 2025-05-22

BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation

BAGEL

BAGEL is a unified multimodal foundation model open-sourced by ByteDance, featuring 7B active parameters (14B in total). It can process and generate text, images, and videos, enabling comprehensive multimodal understanding and creation. BAGEL achieves leading results on major public benchmarks and supports high-quality text-to-image generation, advanced image editing, and world modeling capabilities. BAGEL

Key Features

  • Unified Multimodal Modeling: BAGEL can handle text, image, and video inputs simultaneously, and outputs can be text, images, or a combination. It is suitable for multi-turn dialogue, image generation, and video understanding scenarios.
  • Powerful Generation and Editing: Supports high-fidelity image and video frame generation, advanced image editing (such as style transfer, 3D animation, plush toy style), and flexible visual manipulation.
  • World Modeling and Navigation: Trained on large-scale video and web data, BAGEL learns dynamic knowledge of the real world, supporting multi-view synthesis and world navigation tasks.
  • Multi-turn Interaction and Reasoning: Enables multi-turn multimodal dialogue and features Chain-of-Thought (CoT) reasoning, turning short prompts into detailed, logically consistent outputs.

Technical Architecture

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture, combining two independent visual encoders to capture pixel-level and semantic-level features. The overall framework is based on a “next group of token prediction” paradigm, with pretraining, continued training, and supervised finetuning on large-scale interleaved multimodal data, resulting in strong understanding and generation capabilities.

  • Visual Understanding: Uses a ViT encoder to convert images into tokens, enhancing visual content understanding.
  • Visual Generation: Integrates the FLUX.1-schnell variational autoencoder (VAE) for high-quality image generation.
  • Generalized Causal Attention: Efficiently interacts with multimodal tokens, improving context consistency in reasoning and generation.

Performance

BAGEL demonstrates strong results on public benchmarks:

  • Visual Understanding: Outperforms similar open-source models on MME, MMBench, MM-Vet, MathVista, and other benchmarks.
  • Text-to-Image Generation: Achieves an overall GenEval score of 0.88, surpassing FLUX-1-dev, SD3-Medium, and Janus-Pro-7B.
  • Image Editing: Excels on GEdit-Bench-EN and IntelligentBench, with higher structure consistency and prompt quality than mainstream models.
TaskMetric/BenchmarkBAGEL ScoreComparison Models
Visual UnderstandingMME2388Qwen2.5-VL-7B: 2347
MMBench85.0Janus-Pro-7B: 79.2
MM-Vet67.2Qwen2.5-VL-7B: 67.1
Text-to-ImageGenEval0.88FLUX-1-dev: 0.82
Image EditingGEdit-Bench-EN SC7.36Step1X-Edit: 7.09
IntelligentBench44.0Step1X-Edit: 14.9

Emerging Abilities

As pretraining scales up, BAGEL exhibits staged emergence of abilities: early multimodal understanding and generation, mid-stage basic image editing, and later complex intelligent editing, flexible visual manipulation, and world modeling. Studies show that combining VAE and ViT features significantly enhances intelligent editing, highlighting the importance of visual-semantic context for advanced multimodal reasoning.

Application Scenarios

  • AI image generation and editing
  • Multimodal dialogue and Q&A
  • Video understanding and world modeling
  • Cross-modal content creation and assistance

Open Source and License

BAGEL is open-sourced under the Apache 2.0 license. Model weights, code, and documentation are available via the links below. The model is finetuned and integrated from Qwen2.5-7B-Instruct, siglip-so400m-14-384-flash-attn2, and FLUX.1-schnell VAE.


Sources