Alibaba AIDC-AI Releases Ovis-Image - 7B Text-to-Image Model Optimized for Text Rendering

On November 29, 2025, Alibaba’s AIDC-AI team released Ovis-Image, a 7B parameter text-to-image model built upon Ovis-U1. The model is specifically optimized for high-quality text rendering and can run efficiently under limited computational resources.
Model Features
Text Rendering at Compact Scale
Ovis-Image has a parameter size of 2B+7B. Compared to larger models like Qwen-Image (7B+20B), Ovis-Image achieves comparable or even better performance on text rendering tasks. On the CVTG-2K benchmark, Ovis-Image achieved a text rendering accuracy (WA average) of 0.9200, significantly higher than Qwen-Image’s 0.8288 and GPT4o’s 0.8569.
High-Fidelity Output for Text-Heavy Scenarios
The model excels in scenarios requiring precise text-image alignment, including:
- Poster and banner design
- Logo and brand graphics
- UI mockups
- Infographics
Ovis-Image generates clear, readable text with correct spelling and semantic consistency across different fonts, sizes, and aspect ratios.
Deployment-Friendly
With its 7B parameter size and streamlined architecture, Ovis-Image:
- Runs on a single high-end GPU
- Supports low-latency interactive use
- Suits production scenarios requiring text rendering without deploying models with tens of billions of parameters
Performance
CVTG-2K Text Rendering Benchmark
| Model | Parameters | WA (avg) | NED↑ | CLIPScore↑ |
|---|---|---|---|---|
| GPT4o | - | 0.8569 | 0.9478 | 0.7982 |
| Qwen-Image | 7B+20B | 0.8288 | 0.9116 | 0.8017 |
| TextCrafter | 11B+12B | 0.7370 | 0.8679 | 0.7868 |
| Ovis-Image | 2B+7B | 0.9200 | 0.9695 | 0.8368 |
LongText-Bench Long Text Rendering
| Model | Parameters | English | Chinese |
|---|---|---|---|
| GPT4o | - | 0.956 | 0.619 |
| Qwen-Image | 7B+20B | 0.943 | 0.946 |
| Ovis-Image | 2B+7B | 0.922 | 0.964 |
For Chinese long text rendering, Ovis-Image surpassed all tested models with a score of 0.964.
General Image Generation
Beyond text rendering, Ovis-Image maintains strong performance on general text-to-image benchmarks like DPG-Bench and GenEval:
- DPG-Bench Overall: 86.59 (Qwen-Image: 88.32)
- GenEval Overall: 0.84 (on par with GPT4o)
- OneIG-EN Overall: 0.530 (close to Qwen-Image’s 0.539)
Technical Background
Ovis-Image is built upon Ovis-U1 and incorporates design elements from FLUX. The model has been tested with Python 3.10, Torch 2.6.0, and Transformers 4.57.1.
The development team has also released a dedicated diffusers branch for easy adoption.
How to Try
Users can experience Ovis-Image through:
- Online Demo: Try the model directly on Hugging Face Space
- Local Deployment: Run local inference via PyTorch or Diffusers library
Related Links
- Paper: https://arxiv.org/abs/2511.22982
- Model: https://huggingface.co/AIDC-AI/Ovis-Image-7B
- Online Demo: https://huggingface.co/spaces/AIDC-AI/Ovis-Image-7B
- GitHub: https://github.com/AIDC-AI/Ovis-Image