Skip to content
ComfyUI Wiki
Help Build a Better ComfyUI Knowledge Base Become a Patron

ByteDance USO ComfyUI Workflow Guide, Image Style Transfer and Subject Identity Preservation Image Generation

USO (Unified Style and Subject-Driven Generation) is a model developed by ByteDance’s UXO Team that unifies style-driven and subject-driven generation tasks. Built on FLUX.1-dev architecture, the model addresses the issue where traditional methods treat style-driven and subject-driven generation as opposing tasks. USO solves this through a unified framework with decoupling and recombination of content and style as its core goal.

USO

The model adopts a two-stage training method:

  • Stage One: Align SigLIP embeddings through style alignment training to obtain a model with style capabilities
  • Stage Two: Decouple the conditional encoder and train on triplet data to achieve joint conditional generation

USO supports multiple generation modes:

  • Subject-Driven Generation: Maintains subject identity consistency, suitable for stylizing specific subjects such as people and objects
  • Style-Driven Generation: Achieves high-quality style transfer by applying the style of reference images to new content
  • Identity-Driven Generation: Performs stylization while maintaining identity characteristics, particularly suitable for portrait stylization
  • Joint Style-Subject Generation: Simultaneously controls subject and style to achieve complex creative expressions
  • Multi-Style Mixed Generation: Supports the fusion application of multiple styles

Related Links

ByteDance USO ComfyUI Native Workflow

Loading...

1. Workflow and input

Download the image below and drag it into ComfyUI to load the corresponding workflow.

Workflow

Use the image below as an input image.

input

checkpoints

loras

model_patches

clip_visions

Please download all models and place them in the following directories:

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 checkpoints/
│   │   └── flux1-dev-fp8.safetensors
│   ├── 📂 loras/
│   │   └── uso-flux1-dit-lora-v1.safetensors
│   ├── 📂 model_patches/
│   │   └── uso-flux1-projector-v1.safetensors
│   ├── 📂 clip_visions/
│   │   └── sigclip_vision_patch14_384.safetensors

3. Workflow instructions

Workflow

  1. Load models:
    • 1.1 Ensure the Load Checkpoint node has flux1-dev-fp8.safetensors loaded
    • 1.2 Ensure the LoraLoaderModelOnly node has dit_lora.safetensors loaded
    • 1.3 Ensure the ModelPatchLoader node has projector.safetensors loaded
    • 1.4 Ensure the Load CLIP Vision node has sigclip_vision_patch14_384.safetensors loaded
  2. Content Reference:
    • 2.1 Click Upload to upload the input image we provided
    • 2.2 The ImageScaleToMaxDimension node will scale your input image for content reference, 512px will keep more character features, but if you only use the character’s head as input, the final output image often has issues like the character taking up too much space. Setting it to 1024px gives much better results.
  3. In the example, we only use the content reference image input. If you want to use the style reference image input, you can use Ctrl-B to bypass the marked node group.
  4. Write your prompt or keep default
  5. Set the image size if you need
  6. The EasyCache node is for inference acceleration, but it will also sacrifice some quality and details. You can bypass it (Ctrl+B) if you don’t need to use it.
  7. Click the Run button, or use the shortcut Ctrl(Cmd) + Enter to run the workflow

4. Additional Notes

  1. Style reference only:

We also provide a workflow that only uses style reference in the same workflow we provided

Workflow The only different is we replaced the content reference node and only use an Empty Latent Image node.

  1. You can also bypass whole Style Reference group and use the workflow as a text to image workflow, which means this workflow has 4 variations:
  • Subject-Driven Generation: Only use content (subject) reference
  • Style-Driven Generation: Only use style reference
  • Joint Style-Subject Generation: Mixed content and style reference
  • Text-to-Image Generation: As a standard text to image workflow