DomainShuttle: HKUST Open-Sources 14B Subject-Driven Text-to-Video on Wan2.2
HKUST C4G releases DomainShuttle, an Apache-2.0 open-domain subject-driven video generation model built on Wan2.2-T2V-14B. Features Domain-MoT, Video-Reference DualRoPE, and Cross-Pair Consistent Loss for flexible in-domain fidelity and cross-domain style transfer.
On June 23, 2026, the C4G Lab at Hong Kong University of Science and Technology (HKUST) released DomainShuttle, an open-domain subject-driven text-to-video generation method under the Apache 2.0 license. The model is built on Wan2.2-T2V-A14B and introduces a novel architecture for flexible subject personalization across both in-domain and cross-domain scenarios.
TL;DR DomainShuttle lets you shuttle any subject across domains — keep it in its original style (in-domain) or transform it into new styles, semantics, and environments (cross-domain) — while preserving the subject's intrinsic identity.
What Makes DomainShuttle Different
Existing subject-driven video methods excel at in-domain fidelity but struggle with cross-domain editability — changing a character's style, posing it in a new environment, or applying semantic transformations while keeping identity intact. DomainShuttle is designed from the ground up to handle both.
The method introduces three technical contributions:
1. Domain-MoT (Mixture-of-Transformers)
Decouples video features and reference image features through separate transformer pathways. A domain-aware AdaLN (Adaptive Layer Normalization) module enables domain-specific modeling of reference images, letting the model distinguish between what is intrinsic to the subject and what belongs to the surrounding domain (style, lighting, background).
2. Video-Reference DualRoPE
Places reference image tokens and video generation tokens in separate RoPE (Rotary Position Embedding) spaces. This allows precise subject-level spatial modeling — the model treats the reference subject as an anchor and maps it into the video's coordinate system without positional confusion.
3. Cross-Pair Consistent Loss
A novel training objective that extracts intrinsic subject features unaffected by irrelevant attributes (background, pose, lighting, camera angle). By enforcing consistency across different prompt-driven variations of the same subject, the model learns what makes the subject itself, not the context around it.
Architecture & Availability
DomainShuttle is a 14B-parameter model built on Wan2.2's T2V backbone. It runs 480p and 720p inference on standard GPUs.
| Resource | Link |
|---|---|
| GitHub | HKUST-C4G/DomainShuttle |
| HuggingFace Weights | CNcreator0331/DomainShuttle_weight |
| Technical Report | arXiv 2606.26058 |
| Project Page | cn-makers.github.io/DomainShuttle |
| License | Apache 2.0 |
Quick Start
conda create -n DomainShuttle python=3.10
conda activate DomainShuttle
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
bash build_env_conda.sh
# Download weights
hf download CNcreator0331/DomainShuttle_weight --local-dir ./models/Diffusion_Transformers/Wan2.2-DomainShuttle-A14B
hf download Wan-AI/Wan2.2-T2V-A14B --local-dir ./checkpoints/Wan2.2-T2V-A14B
# Inference
bash run_wan22_domainshuttle.shPerformance benchmarks from the paper show DomainShuttle achieves significant improvements in subject consistency metrics (CLIP, DINO, face similarity) over prior methods across diverse open-domain scenarios, including human-object interaction, multi-object generation, and multi-person generation.
Links
- GitHub Repository
- arXiv Paper
- HuggingFace Model
- Project Page
- License: Apache 2.0