OmniGen2 Released: Unified Image Understanding and Generation Model with Natural Language Instructions
VectorSpaceLab team has officially released OmniGen2, a powerful multimodal image generation model. Unlike its predecessor OmniGen v1, OmniGen2 features a dual-pathway decoding design for text and image modalities, utilizing independent parameters and a decoupled image tokenizer, achieving significant performance improvements in image editing.
Core Features
OmniGen2 possesses four core capabilities, with particular excellence in image editing:
Natural Language Instruction-Guided Image Editing
The highlight of OmniGen2 is its support for precise local image editing through natural language instructions. Users can simply describe the desired modifications, and the model can accurately execute various complex editing tasks:
- Clothing modification: Such as “Change the dress to blue”
- Action adjustment: Such as “Raise the hand”, “Make him smile”
- Background processing: Such as “Change the background to classroom”
- Object addition: Such as “Add a fisherman hat to the woman’s head”
- Object replacement: Such as “Replace the sword with a hammer”
- Object removal: Such as “Remove the cat”
- Style conversion: Such as “Generate an anime-style figurine based on the original image”
Text-to-Image Generation
The model can generate high-quality, aesthetically pleasing images based on textual descriptions, supporting various creative scenarios.
In-Context Generation
OmniGen2 has the capability to process and flexibly combine diverse inputs, including humans, reference objects, and scenes, producing novel and coherent visual outputs.
Visual Understanding
Inherits the robust visual understanding capabilities from its Qwen-VL-2.5 foundation, capable of interpreting and analyzing image content.
Technical Advantages
OmniGen2 achieves state-of-the-art performance in image editing among open-source models, with the following advantages:
- More precise editing control: Fine-grained image modifications through natural language instructions
- High resource efficiency: Provides CPU offload options, supporting devices with limited VRAM
- Multi-language support: While English performs best, supports multiple language inputs
- Easy to use: Provides simple API interfaces and online demonstrations
System Requirements and Usage
OmniGen2 natively requires an NVIDIA RTX 3090 or equivalent GPU with approximately 17GB VRAM. For devices with less VRAM, CPU offload functionality can be enabled to run the model.
The model supports multiple usage methods:
- Command-line tools
- Gradio web interface
- Jupyter notebooks
- Online demonstration platforms
Usage Recommendations
For optimal results, users are advised to:
- Use high-quality images: Provide clear images, preferably with resolution greater than 512×512 pixels
- Detailed instruction descriptions: Clearly describe what to modify and the expected results
- Use English prompts: The model performs best with English prompts
- Adjust parameter settings: Adjust text guidance strength and image guidance strength based on task type
Technical Limitations
The current version has some limitations:
- The model may sometimes not fully follow instructions; generating multiple images for selection is recommended
- Cannot automatically determine output image size, defaults to 1024×1024
- When processing multiple images, manual output size setting is required to match the editing target
Related Links
- Project Homepage: https://vectorspacelab.github.io/OmniGen2
- GitHub Repository: https://github.com/VectorSpaceLab/OmniGen2
- Model Download: https://huggingface.co/OmniGen2/OmniGen2
- Online Demo: https://huggingface.co/spaces/OmniGen2/OmniGen2
- Technical Paper: https://arxiv.org/abs/2506.18871
As an open-source project, OmniGen2 provides a powerful and efficient foundation tool for researchers and developers exploring controllable and personalized generative AI. The team indicates they will release training code and datasets to provide more support to the community.