ComfyUI Subgraph Feature Now Officially Released

08/07/2025

Insert Anything: Open-Source Framework for Seamless Image Insertion

Insert Anything Overview

Insert Anything is a new open-source image editing framework jointly developed by a research team (Wensong Song, Hong Jiang, Zongxing Yang, Ruijie Quan, Yi Yang) from Zhejiang University, Harvard University, and Nanyang Technological University. This framework can seamlessly integrate objects from reference images into target scenes under user-specified control guidance.

This unified image insertion framework supports multiple practical application scenarios, including artistic creation, real face replacement, movie scene composition, virtual clothing try-on, accessory customization, and digital prop replacement, fully demonstrating its versatility and effectiveness in various image editing tasks.

Key Features

Unified Insertion Framework: No need to train separate models for different tasks, one model supports multiple insertion scenarios
Multiple Control Methods: Supports mask-based and text-based editing guidance
Identity Feature Preservation: Accurately captures identity features and fine details, while allowing diverse local adjustments in style, color, and texture
Context Editing Mechanism: Treats reference images as contextual information, using two prompting strategies to harmoniously blend inserted elements with the target scene
Low VRAM Version Support: Provides a 10GB VRAM version based on Nunchaku, convenient for ordinary users

Application Showcases

Meme Creation

Meme creation is an important application scenario for Insert Anything. Here are some comparison images:

Meme Example 1 Meme Effect 1

Meme Example 2 Meme Effect 2

Meme Example 3 Meme Effect 3

Commercial Advertisement Design

Commercial advertisement design is another important application scenario for Insert Anything. Here are some comparison images:

Ad Example 1 Ad Effect 1

Ad Example 2 Ad Effect 2

Ad Example 3 Ad Effect 3

Pop Culture Creation

Pop culture creation showcases Insert Anything’s potential in creative content generation:

Pop Culture Example 1 Pop Culture Effect 1

Pop Culture Example 2 Pop Culture Effect 2

Pop Culture Example 3 Pop Culture Effect 3

Pop Culture Example 4 Pop Culture Effect 4

Technical Highlights

Insert Anything Method Overview

Insert Anything utilizes the multimodal attention mechanism of Diffusion Transformer (DiT), supporting both mask-based and text-based editing. According to different types of prompts, this unified framework processes multiple input images (combinations of reference images, source images, and masks) through a frozen VAE encoder to preserve high-frequency details, and extracts semantic guidance from image and text encoders. These embeddings are combined and input into learnable DiT transformer blocks for context learning, enabling precise and flexible image insertion based on mask or text prompts.

AnyInsertion Dataset

AnyInsertion Dataset Examples

AnyInsertion Dataset Information

To train this unified framework, the research team created the AnyInsertion dataset, which contains approximately 120,000 prompt-image pairs covering various insertion tasks such as person, object, and clothing insertion. The dataset is divided into mask-based and text-based categories, each further subdivided into accessories, objects, and person subcategories.

The image pairs in the dataset come from internet resources, person videos, and multi-view images. The dataset covers various insertion scenarios:

Furniture and interior decoration
Daily necessities
Clothing and accessories
Transportation vehicles
People

Open Source and Usage

The Insert Anything project has been open-sourced on GitHub, and anyone can freely download and use it:

GitHub Repository: song-wensong/insert-anything
Dataset: WensongSong/AnyInsertion

The project provides multiple usage methods:

Command-line inference scripts
Gradio interface
ComfyUI integration nodes

Hardware Requirements

Insert Anything offers two versions:

Standard Version: Requires 26GB or 40GB VRAM
Lightweight Version: Optimized version based on Nunchaku, requires only 10GB VRAM

Future Plans

According to the official GitHub repository information, the team plans to:

Release training code
Release the AnyInsertion text prompt dataset on HuggingFace

The release of this open-source framework will provide creative workers, designers, and content creators with a powerful tool to achieve more flexible and precise image editing effects.

RunComfy

Comfy Deploy

Comfy Online

Comfy.ICU

InstaSD

ComfyUI Subgraph Feature Now Officially Released

Insert Anything: Open-Source Framework for Seamless Image Insertion

Key Features