ComfyUI Subgraph Feature Now Officially Released

08/07/2025

title: “BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation” description: “ByteDance releases BAGEL, an open-source multimodal foundation model with 7B active parameters, supporting understanding and generation across text, image, and video, and achieving strong results on public benchmarks.” tag: open-source, bytedance date: 2025-05-22

BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation

BAGEL is a unified multimodal foundation model open-sourced by ByteDance, featuring 7B active parameters (14B in total). It can process and generate text, images, and videos, enabling comprehensive multimodal understanding and creation. BAGEL achieves leading results on major public benchmarks and supports high-quality text-to-image generation, advanced image editing, and world modeling capabilities. BAGEL

Key Features

Unified Multimodal Modeling: BAGEL can handle text, image, and video inputs simultaneously, and outputs can be text, images, or a combination. It is suitable for multi-turn dialogue, image generation, and video understanding scenarios.
Powerful Generation and Editing: Supports high-fidelity image and video frame generation, advanced image editing (such as style transfer, 3D animation, plush toy style), and flexible visual manipulation.
World Modeling and Navigation: Trained on large-scale video and web data, BAGEL learns dynamic knowledge of the real world, supporting multi-view synthesis and world navigation tasks.
Multi-turn Interaction and Reasoning: Enables multi-turn multimodal dialogue and features Chain-of-Thought (CoT) reasoning, turning short prompts into detailed, logically consistent outputs.

Technical Architecture

BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture, combining two independent visual encoders to capture pixel-level and semantic-level features. The overall framework is based on a “next group of token prediction” paradigm, with pretraining, continued training, and supervised finetuning on large-scale interleaved multimodal data, resulting in strong understanding and generation capabilities.

Visual Understanding: Uses a ViT encoder to convert images into tokens, enhancing visual content understanding.
Visual Generation: Integrates the FLUX.1-schnell variational autoencoder (VAE) for high-quality image generation.
Generalized Causal Attention: Efficiently interacts with multimodal tokens, improving context consistency in reasoning and generation.

Performance

BAGEL demonstrates strong results on public benchmarks:

Visual Understanding: Outperforms similar open-source models on MME, MMBench, MM-Vet, MathVista, and other benchmarks.
Text-to-Image Generation: Achieves an overall GenEval score of 0.88, surpassing FLUX-1-dev, SD3-Medium, and Janus-Pro-7B.
Image Editing: Excels on GEdit-Bench-EN and IntelligentBench, with higher structure consistency and prompt quality than mainstream models.

Task	Metric/Benchmark	BAGEL Score	Comparison Models
Visual Understanding	MME	2388	Qwen2.5-VL-7B: 2347
	MMBench	85.0	Janus-Pro-7B: 79.2
	MM-Vet	67.2	Qwen2.5-VL-7B: 67.1
Text-to-Image	GenEval	0.88	FLUX-1-dev: 0.82
Image Editing	GEdit-Bench-EN SC	7.36	Step1X-Edit: 7.09
	IntelligentBench	44.0	Step1X-Edit: 14.9

Emerging Abilities

As pretraining scales up, BAGEL exhibits staged emergence of abilities: early multimodal understanding and generation, mid-stage basic image editing, and later complex intelligent editing, flexible visual manipulation, and world modeling. Studies show that combining VAE and ViT features significantly enhances intelligent editing, highlighting the importance of visual-semantic context for advanced multimodal reasoning.

Application Scenarios

AI image generation and editing
Multimodal dialogue and Q&A
Video understanding and world modeling
Cross-modal content creation and assistance

Open Source and License

BAGEL is open-sourced under the Apache 2.0 license. Model weights, code, and documentation are available via the links below. The model is finetuned and integrated from Qwen2.5-7B-Instruct, siglip-so400m-14-384-flash-attn2, and FLUX.1-schnell VAE.

Sources

BAGEL Official Paper

BAGEL Project Homepage

BAGEL GitHub Repository

RunningHub

RunComfy

Comfy Deploy

Comfy Online

Comfy.ICU

InstaSD

优云智算

ComfyUI Subgraph Feature Now Officially Released

BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation

Key Features

Technical Architecture

Performance

Emerging Abilities

Application Scenarios

Open Source and License

RunningHub

RunComfy

Comfy Deploy

Comfy Online

Comfy.ICU

InstaSD

优云智算

ComfyUI Subgraph Feature Now Officially Released

BAGEL: ByteDance Open Sources a Unified Multimodal Foundation Model for Text, Image, and Video Understanding and Generation

Key Features

Technical Architecture

Performance

Emerging Abilities

Application Scenarios

Open Source and License

Related Links