title: “BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation” description: “ByteDance releases BAGEL, an open-source multimodal foundation model with 7B active parameters, supporting understanding and generation across text, image, and video, and achieving strong results on public benchmarks.” tag: open-source, bytedance date: 2025-05-22
BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation
BAGEL is a unified multimodal foundation model open-sourced by ByteDance, featuring 7B active parameters (14B in total). It can process and generate text, images, and videos, enabling comprehensive multimodal understanding and creation. BAGEL achieves leading results on major public benchmarks and supports high-quality text-to-image generation, advanced image editing, and world modeling capabilities.
Key Features
- Unified Multimodal Modeling: BAGEL can handle text, image, and video inputs simultaneously, and outputs can be text, images, or a combination. It is suitable for multi-turn dialogue, image generation, and video understanding scenarios.
- Powerful Generation and Editing: Supports high-fidelity image and video frame generation, advanced image editing (such as style transfer, 3D animation, plush toy style), and flexible visual manipulation.
- World Modeling and Navigation: Trained on large-scale video and web data, BAGEL learns dynamic knowledge of the real world, supporting multi-view synthesis and world navigation tasks.
- Multi-turn Interaction and Reasoning: Enables multi-turn multimodal dialogue and features Chain-of-Thought (CoT) reasoning, turning short prompts into detailed, logically consistent outputs.
Technical Architecture
BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture, combining two independent visual encoders to capture pixel-level and semantic-level features. The overall framework is based on a “next group of token prediction” paradigm, with pretraining, continued training, and supervised finetuning on large-scale interleaved multimodal data, resulting in strong understanding and generation capabilities.
- Visual Understanding: Uses a ViT encoder to convert images into tokens, enhancing visual content understanding.
- Visual Generation: Integrates the FLUX.1-schnell variational autoencoder (VAE) for high-quality image generation.
- Generalized Causal Attention: Efficiently interacts with multimodal tokens, improving context consistency in reasoning and generation.
Performance
BAGEL demonstrates strong results on public benchmarks:
- Visual Understanding: Outperforms similar open-source models on MME, MMBench, MM-Vet, MathVista, and other benchmarks.
- Text-to-Image Generation: Achieves an overall GenEval score of 0.88, surpassing FLUX-1-dev, SD3-Medium, and Janus-Pro-7B.
- Image Editing: Excels on GEdit-Bench-EN and IntelligentBench, with higher structure consistency and prompt quality than mainstream models.
Task | Metric/Benchmark | BAGEL Score | Comparison Models |
---|---|---|---|
Visual Understanding | MME | 2388 | Qwen2.5-VL-7B: 2347 |
MMBench | 85.0 | Janus-Pro-7B: 79.2 | |
MM-Vet | 67.2 | Qwen2.5-VL-7B: 67.1 | |
Text-to-Image | GenEval | 0.88 | FLUX-1-dev: 0.82 |
Image Editing | GEdit-Bench-EN SC | 7.36 | Step1X-Edit: 7.09 |
IntelligentBench | 44.0 | Step1X-Edit: 14.9 |
Emerging Abilities
As pretraining scales up, BAGEL exhibits staged emergence of abilities: early multimodal understanding and generation, mid-stage basic image editing, and later complex intelligent editing, flexible visual manipulation, and world modeling. Studies show that combining VAE and ViT features significantly enhances intelligent editing, highlighting the importance of visual-semantic context for advanced multimodal reasoning.
Application Scenarios
- AI image generation and editing
- Multimodal dialogue and Q&A
- Video understanding and world modeling
- Cross-modal content creation and assistance
Open Source and License
BAGEL is open-sourced under the Apache 2.0 license. Model weights, code, and documentation are available via the links below. The model is finetuned and integrated from Qwen2.5-7B-Instruct, siglip-so400m-14-384-flash-attn2, and FLUX.1-schnell VAE.
Related Links
- BAGEL Official Website
- BAGEL Paper (arXiv)
- BAGEL GitHub Repository
- Hugging Face Model Page
- BAGEL Online Demo
Sources