THUDM Open-Sources CogView4 - Native Chinese-Supported DiT Text-to-Image Model
THUDM has officially open-sourced the CogView4 multimodal generation model, the first Diffusion Transformer (DiT) model with native Chinese support and Chinese character generation capabilities. The model achieved a top score of 85.13 in the DPG-Bench benchmark, demonstrating exceptional image generation quality.
Key Features
Bilingual Generation
- Enhanced GLM-4 text encoder supporting Chinese-English bilingual input
- Trained on millions of Chinese-English image-text pairs
- Achieves 61.68% F1 score in Chinese character generation accuracy
Smart Text Processing
- Dynamic text length support (up to 1024 tokens)
- Reduces redundant computations by 50% compared to fixed-length solutions
- Improves training efficiency by up to 30%
Flexible Resolution
- Supports output from 512px to 2048px
- Mixed-resolution training for different scenarios
- Optimized for social media aspect ratios (9:16, 1:1, 4:3)
Technical Advantages
Innovative “Relay Diffusion” framework:
- Base Generation: Rapid low-resolution outline creation
- Super-Resolution: Detail refinement through flow-matching
- Dynamic Noise Scheduling: Optimizes speed-quality balance
Benchmark Performance:
- DPG-Bench score 85.13 (vs SDXL 74.65 / DALL-E 3 83.50)
- T2I-CompBench complex scene score 0.3869
- 114% improvement in Chinese character generation accuracy
Hardware Optimization
Multi-level optimization for different devices:
- Basic Mode: Runs on RTX 3090 for 512x512 generation
- Memory Optimization: Reduces VRAM usage to 13GB via CPU offloading
- 4bit Quantization: Accelerates inference with compressed text encoder
Usage
Available through HuggingFace Spaces for instant testing. Developers can access full codebase via:
- Mixed Chinese-English prompts
- Custom output dimensions
- Batch generation support
Resources
THUDM plans to release ControlNet modules, ComfyUI workflow support, and fine-tuning toolkits within three months to enhance accessibility for non-technical users.