OmniGen2 Released: Unified Image Understanding and Generation Model with Natural Language Instructions

VectorSpaceLab team has officially released OmniGen2, a powerful multimodal image generation model. Unlike its predecessor OmniGen v1, OmniGen2 features a dual-pathway decoding design for text and image modalities, utilizing independent parameters and a decoupled image tokenizer, achieving significant performance improvements in image editing.

OmniGen2 Overall Capabilities

Core Features

OmniGen2 possesses four core capabilities, with particular excellence in image editing:

Natural Language Instruction-Guided Image Editing

The highlight of OmniGen2 is its support for precise local image editing through natural language instructions. Users can simply describe the desired modifications, and the model can accurately execute various complex editing tasks:

Clothing modification: Such as “Change the dress to blue”
Action adjustment: Such as “Raise the hand”, “Make him smile”
Background processing: Such as “Change the background to classroom”
Object addition: Such as “Add a fisherman hat to the woman’s head”
Object replacement: Such as “Replace the sword with a hammer”
Object removal: Such as “Remove the cat”
Style conversion: Such as “Generate an anime-style figurine based on the original image”

OmniGen2 Image Editing Capabilities

Text-to-Image Generation

The model can generate high-quality, aesthetically pleasing images based on textual descriptions, supporting various creative scenarios.

In-Context Generation

OmniGen2 has the capability to process and flexibly combine diverse inputs, including humans, reference objects, and scenes, producing novel and coherent visual outputs.

OmniGen2 In-Context Generation Capabilities

Visual Understanding

Inherits the robust visual understanding capabilities from its Qwen-VL-2.5 foundation, capable of interpreting and analyzing image content.

Technical Advantages

OmniGen2 achieves state-of-the-art performance in image editing among open-source models, with the following advantages:

More precise editing control: Fine-grained image modifications through natural language instructions
High resource efficiency: Provides CPU offload options, supporting devices with limited VRAM
Multi-language support: While English performs best, supports multiple language inputs
Easy to use: Provides simple API interfaces and online demonstrations

System Requirements and Usage

OmniGen2 natively requires an NVIDIA RTX 3090 or equivalent GPU with approximately 17GB VRAM. For devices with less VRAM, CPU offload functionality can be enabled to run the model.

The model supports multiple usage methods:

Command-line tools
Gradio web interface
Jupyter notebooks
Online demonstration platforms

Usage Recommendations

For optimal results, users are advised to:

Use high-quality images: Provide clear images, preferably with resolution greater than 512×512 pixels
Detailed instruction descriptions: Clearly describe what to modify and the expected results
Use English prompts: The model performs best with English prompts
Adjust parameter settings: Adjust text guidance strength and image guidance strength based on task type

Technical Limitations

The current version has some limitations:

The model may sometimes not fully follow instructions; generating multiple images for selection is recommended
Cannot automatically determine output image size, defaults to 1024×1024
When processing multiple images, manual output size setting is required to match the editing target

Project Homepage: https://vectorspacelab.github.io/OmniGen2
GitHub Repository: https://github.com/VectorSpaceLab/OmniGen2
Model Download: https://huggingface.co/OmniGen2/OmniGen2
Online Demo: https://huggingface.co/spaces/OmniGen2/OmniGen2
Technical Paper: https://arxiv.org/abs/2506.18871

As an open-source project, OmniGen2 provides a powerful and efficient foundation tool for researchers and developers exploring controllable and personalized generative AI. The team indicates they will release training code and datasets to provide more support to the community.

Alibaba Tongyi Lab Releases Z-Image-Turbo - Efficient 6B Parameter Image Generation Model