THUDM Open-Sources CogView4 - Native Chinese-Supported DiT Text-to-Image Model

CogView4 Sample Outputs

THUDM has officially open-sourced the CogView4 multimodal generation model, the first Diffusion Transformer (DiT) model with native Chinese support and Chinese character generation capabilities. The model achieved a top score of 85.13 in the DPG-Bench benchmark, demonstrating exceptional image generation quality.

Key Features

Bilingual Generation

Enhanced GLM-4 text encoder supporting Chinese-English bilingual input
Trained on millions of Chinese-English image-text pairs
Achieves 61.68% F1 score in Chinese character generation accuracy

Smart Text Processing

Dynamic text length support (up to 1024 tokens)
Reduces redundant computations by 50% compared to fixed-length solutions
Improves training efficiency by up to 30%

Flexible Resolution

Supports output from 512px to 2048px
Mixed-resolution training for different scenarios
Optimized for social media aspect ratios (9:16, 1:1, 4:3)

Technical Advantages

Innovative “Relay Diffusion” framework:

Base Generation: Rapid low-resolution outline creation
Super-Resolution: Detail refinement through flow-matching
Dynamic Noise Scheduling: Optimizes speed-quality balance

Benchmark Performance:

DPG-Bench score 85.13 (vs SDXL 74.65 / DALL-E 3 83.50)
T2I-CompBench complex scene score 0.3869
114% improvement in Chinese character generation accuracy

Hardware Optimization

Multi-level optimization for different devices:

Basic Mode: Runs on RTX 3090 for 512x512 generation
Memory Optimization: Reduces VRAM usage to 13GB via CPU offloading
4bit Quantization: Accelerates inference with compressed text encoder

Usage

Available through HuggingFace Spaces for instant testing. Developers can access full codebase via:

Mixed Chinese-English prompts
Custom output dimensions
Batch generation support

Resources

THUDM plans to release ControlNet modules, ComfyUI workflow support, and fine-tuning toolkits within three months to enhance accessibility for non-technical users.