Skip to content
Follow me on X
ComfyUI Wiki
NewsAlibaba AIDC-AI Releases Ovis-Image - 7B Text-to-Image Model Optimized for Text Rendering

Alibaba AIDC-AI Releases Ovis-Image - 7B Text-to-Image Model Optimized for Text Rendering

Ovis-Image

On November 29, 2025, Alibaba’s AIDC-AI team released Ovis-Image, a 7B parameter text-to-image model built upon Ovis-U1. The model is specifically optimized for high-quality text rendering and can run efficiently under limited computational resources.

Model Features

Text Rendering at Compact Scale

Ovis-Image has a parameter size of 2B+7B. Compared to larger models like Qwen-Image (7B+20B), Ovis-Image achieves comparable or even better performance on text rendering tasks. On the CVTG-2K benchmark, Ovis-Image achieved a text rendering accuracy (WA average) of 0.9200, significantly higher than Qwen-Image’s 0.8288 and GPT4o’s 0.8569.

High-Fidelity Output for Text-Heavy Scenarios

The model excels in scenarios requiring precise text-image alignment, including:

  • Poster and banner design
  • Logo and brand graphics
  • UI mockups
  • Infographics

Ovis-Image generates clear, readable text with correct spelling and semantic consistency across different fonts, sizes, and aspect ratios.

Deployment-Friendly

With its 7B parameter size and streamlined architecture, Ovis-Image:

  • Runs on a single high-end GPU
  • Supports low-latency interactive use
  • Suits production scenarios requiring text rendering without deploying models with tens of billions of parameters

Performance

CVTG-2K Text Rendering Benchmark

ModelParametersWA (avg)NED↑CLIPScore↑
GPT4o-0.85690.94780.7982
Qwen-Image7B+20B0.82880.91160.8017
TextCrafter11B+12B0.73700.86790.7868
Ovis-Image2B+7B0.92000.96950.8368

LongText-Bench Long Text Rendering

ModelParametersEnglishChinese
GPT4o-0.9560.619
Qwen-Image7B+20B0.9430.946
Ovis-Image2B+7B0.9220.964

For Chinese long text rendering, Ovis-Image surpassed all tested models with a score of 0.964.

General Image Generation

Beyond text rendering, Ovis-Image maintains strong performance on general text-to-image benchmarks like DPG-Bench and GenEval:

  • DPG-Bench Overall: 86.59 (Qwen-Image: 88.32)
  • GenEval Overall: 0.84 (on par with GPT4o)
  • OneIG-EN Overall: 0.530 (close to Qwen-Image’s 0.539)

Technical Background

Ovis-Image is built upon Ovis-U1 and incorporates design elements from FLUX. The model has been tested with Python 3.10, Torch 2.6.0, and Transformers 4.57.1.

The development team has also released a dedicated diffusers branch for easy adoption.

How to Try

Users can experience Ovis-Image through:

  • Online Demo: Try the model directly on Hugging Face Space
  • Local Deployment: Run local inference via PyTorch or Diffusers library