DomainShuttle: HKUST Open-Sources 14B Subject-Driven Text-to-Video on Wan2.2

news

HKUST C4G releases DomainShuttle, an Apache-2.0 open-domain subject-driven video generation model built on Wan2.2-T2V-14B. Features Domain-MoT, Video-Reference DualRoPE, and Cross-Pair Consistent Loss for flexible in-domain fidelity and cross-domain style transfer.

On June 23, 2026, the C4G Lab at Hong Kong University of Science and Technology (HKUST) released DomainShuttle, an open-domain subject-driven text-to-video generation method under the Apache 2.0 license. The model is built on Wan2.2-T2V-A14B and introduces a novel architecture for flexible subject personalization across both in-domain and cross-domain scenarios.

TL;DR DomainShuttle lets you shuttle any subject across domains — keep it in its original style (in-domain) or transform it into new styles, semantics, and environments (cross-domain) — while preserving the subject's intrinsic identity.

What Makes DomainShuttle Different

Existing subject-driven video methods excel at in-domain fidelity but struggle with cross-domain editability — changing a character's style, posing it in a new environment, or applying semantic transformations while keeping identity intact. DomainShuttle is designed from the ground up to handle both.

The method introduces three technical contributions:

1. Domain-MoT (Mixture-of-Transformers)

Decouples video features and reference image features through separate transformer pathways. A domain-aware AdaLN (Adaptive Layer Normalization) module enables domain-specific modeling of reference images, letting the model distinguish between what is intrinsic to the subject and what belongs to the surrounding domain (style, lighting, background).

2. Video-Reference DualRoPE

Places reference image tokens and video generation tokens in separate RoPE (Rotary Position Embedding) spaces. This allows precise subject-level spatial modeling — the model treats the reference subject as an anchor and maps it into the video's coordinate system without positional confusion.

3. Cross-Pair Consistent Loss

A novel training objective that extracts intrinsic subject features unaffected by irrelevant attributes (background, pose, lighting, camera angle). By enforcing consistency across different prompt-driven variations of the same subject, the model learns what makes the subject itself, not the context around it.

Architecture & Availability

DomainShuttle is a 14B-parameter model built on Wan2.2's T2V backbone. It runs 480p and 720p inference on standard GPUs.

ResourceLink
GitHubHKUST-C4G/DomainShuttle
HuggingFace WeightsCNcreator0331/DomainShuttle_weight
Technical ReportarXiv 2606.26058
Project Pagecn-makers.github.io/DomainShuttle
LicenseApache 2.0

Quick Start

conda create -n DomainShuttle python=3.10
conda activate DomainShuttle
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
bash build_env_conda.sh

# Download weights
hf download CNcreator0331/DomainShuttle_weight --local-dir ./models/Diffusion_Transformers/Wan2.2-DomainShuttle-A14B
hf download Wan-AI/Wan2.2-T2V-A14B --local-dir ./checkpoints/Wan2.2-T2V-A14B

# Inference
bash run_wan22_domainshuttle.sh

Performance benchmarks from the paper show DomainShuttle achieves significant improvements in subject consistency metrics (CLIP, DINO, face similarity) over prior methods across diverse open-domain scenarios, including human-object interaction, multi-object generation, and multi-person generation.

DomainShuttle: HKUST Open-Sources 14B Subject-Driven Text-to-Video on Wan2.2 | ComfyUI Wiki