Skip to content
Help Build a Better ComfyUI Knowledge Base Become a Patron

ComfyUI WanFunControlToVideo Node

ComfyUI WanFunControlToVideo Node

This node was added to support the Alibaba Wan Fun Control model for video generation, and was added after this commit.

  • Purpose: Prepare the conditioning information needed for video generation, using the Wan 2.1 Fun Control model.

The WanFunControlToVideo node is a ComfyUI addition designed to support Wan Fun Control models for video generation, aimed at utilizing WanFun control for video creation.

This node serves as a preparation point for essential conditioning information and initializes the center point of the latent space, guiding the subsequent video generation process using the Wan 2.1 Fun model. The nodeā€™s name clearly indicates its function: it accepts various inputs and converts them into a format suitable for controlling video generation within the WanFun framework.

The nodeā€™s position in the ComfyUI node hierarchy indicates that it operates in the early stages of the video generation pipeline, focusing on manipulating conditioning signals before actual sampling or decoding of video frames.

WanFunControlToVideo Node Detailed Analysis

Input Parameters

Parameter NameRequiredData TypeDescriptionDefault Value
positiveYesCONDITIONINGStandard ComfyUI positive conditioning data, typically from a ā€œCLIP Text Encodeā€ node. The positive prompt describes the content, subject matter, and artistic style that the user envisions for the generated video.N/A
negativeYesCONDITIONINGStandard ComfyUI negative conditioning data, typically generated by a ā€œCLIP Text Encodeā€ node. The negative prompt specifies elements, styles, or artifacts that the user wants to avoid in the generated video.N/A
vaeYesVAERequires a VAE (Variational Autoencoder) model compatible with the Wan 2.1 Fun model family, used for encoding and decoding image/video data.N/A
widthYesINTThe desired width of output video frames in pixels, with a default value of 832, minimum value of 16, maximum value determined by nodes.MAX_RESOLUTION, and a step size of 16.832
heightYesINTThe desired height of output video frames in pixels, with a default value of 480, minimum value of 16, maximum value determined by nodes.MAX_RESOLUTION, and a step size of 16.480
lengthYesINTThe total number of frames in the generated video, with a default value of 81, minimum value of 1, maximum value determined by nodes.MAX_RESOLUTION, and a step size of 4.81
batch_sizeYesINTThe number of videos generated in a single batch, with a default value of 1, minimum value of 1, and maximum value of 4096.1
clip_vision_outputNoCLIP_VISION_OUTPUT(Optional) Visual features extracted by a CLIP vision model, allowing for visual style and content guidance.None
start_imageNoIMAGE(Optional) An initial image that influences the beginning of the generated video.None
control_videoNoIMAGE(Optional) Allows users to provide a preprocessed ControlNet reference video that will guide the motion and potential structure of the generated video.None

Output Parameters

Parameter NameData TypeDescription
positiveCONDITIONINGProvides enhanced positive conditioning data, including encoded start_image and control_video.
negativeCONDITIONINGProvides negative conditioning data that has also been enhanced, containing the same concat_latent_image.
latentLATENTA dictionary containing an empty latent tensor with the key ā€œsamplesā€.

Node Example Workflow

Please visit Wan Fun Control Node Example Workflow to understand how ComfyUI natively supports Wan Fun Control models.

Node Source Code

Node source code, code version 3661c833bcc41b788a7c9f0e7bc48524f8ee5f82

class WanFunControlToVideo:
    @classmethod
    def INPUT_TYPES(s):
        return {"required": {"positive": ("CONDITIONING", ),
                             "negative": ("CONDITIONING", ),
                             "vae": ("VAE", ),
                             "width": ("INT", {"default": 832, "min": 16, "max": nodes.MAX_RESOLUTION, "step": 16}),
                             "height": ("INT", {"default": 480, "min": 16, "max": nodes.MAX_RESOLUTION, "step": 16}),
                             "length": ("INT", {"default": 81, "min": 1, "max": nodes.MAX_RESOLUTION, "step": 4}),
                             "batch_size": ("INT", {"default": 1, "min": 1, "max": 4096}),
                },
                "optional": {"clip_vision_output": ("CLIP_VISION_OUTPUT", ),
                             "start_image": ("IMAGE", ),
                             "control_video": ("IMAGE", ),
                }}
 
    RETURN_TYPES = ("CONDITIONING", "CONDITIONING", "LATENT")
    RETURN_NAMES = ("positive", "negative", "latent")
    FUNCTION = "encode"
 
    CATEGORY = "conditioning/video_models"
 
    def encode(self, positive, negative, vae, width, height, length, batch_size, start_image=None, clip_vision_output=None, control_video=None):
        latent = torch.zeros([batch_size, 16, ((length - 1) // 4) + 1, height // 8, width // 8], device=comfy.model_management.intermediate_device())
        concat_latent = torch.zeros([batch_size, 16, ((length - 1) // 4) + 1, height // 8, width // 8], device=comfy.model_management.intermediate_device())
        concat_latent = comfy.latent_formats.Wan21().process_out(concat_latent)
        concat_latent = concat_latent.repeat(1, 2, 1, 1, 1)
 
        if start_image is not None:
            start_image = comfy.utils.common_upscale(start_image[:length].movedim(-1, 1), width, height, "bilinear", "center").movedim(1, -1)
            concat_latent_image = vae.encode(start_image[:, :, :, :3])
            concat_latent[:,16:,:concat_latent_image.shape[2]] = concat_latent_image[:,:,:concat_latent.shape[2]]
 
        if control_video is not None:
            control_video = comfy.utils.common_upscale(control_video[:length].movedim(-1, 1), width, height, "bilinear", "center").movedim(1, -1)
            concat_latent_image = vae.encode(control_video[:, :, :, :3])
            concat_latent[:,:16,:concat_latent_image.shape[2]] = concat_latent_image[:,:,:concat_latent.shape[2]]
 
        positive = node_helpers.conditioning_set_values(positive, {"concat_latent_image": concat_latent})
        negative = node_helpers.conditioning_set_values(negative, {"concat_latent_image": concat_latent})
 
        if clip_vision_output is not None:
            positive = node_helpers.conditioning_set_values(positive, {"clip_vision_output": clip_vision_output})
            negative = node_helpers.conditioning_set_values(negative, {"clip_vision_output": clip_vision_output})
 
        out_latent = {}
        out_latent["samples"] = latent
        return (positive, negative, out_latent)

Analysis of the encode Function

The encode function in the WanFunControlToVideo node is responsible for converting input parameters into conditioning information and latent space that will be used by subsequent video generation models.

The function first initializes an empty latent tensor named latent with a specific shape: [batch_size, 16, ((length - 1) // 4) + 1, height // 8, width // 8]. This tensor is placed on comfy.model_management.intermediate_device(), typically the available GPU. Next, another latent tensor concat_latent is initialized with the same shape as latent, used to store encoded information from the optional start_image and control_video inputs.

After processing the optional visual inputs, the concat_latent tensor now contains encoded information from start_image and control_video (if provided) and is added to both the positive and negative conditioning information under the ā€œconcat_latent_imageā€ key using the node_helpers.conditioning_set_values function.

Finally, the function checks if clip_vision_output is provided. If so, it is also added to both the positive and negative conditioning under the ā€œclip_vision_outputā€ key. This allows visual features extracted by the CLIP model to further refine the generation process.

WanFun Control is primarily used with the Wan 2.1 model family. The method is inspired by ControlNet, a powerful technique widely used in image generation to modulate output through various spatial and structural inputs. WanFun Control extends these principles to the temporal domain of video, enabling users to achieve a high degree of influence over the generated video content, going beyond the limitations of purely text-driven methods. It utilizes visual information extracted from input videos (such as depth maps, edge contours (Canny), or human poses (OpenPose)) to facilitate the creation of controlled videos.

The Wan 2.1 model family is the foundation for WanFun Control, offering different parameter variants including 1.3B and 14B models, providing users with options to balance computational resources with desired output quality and complexity.

The basic concept of WanFun Control is utilizing visual cues from a reference video to guide the AIā€™s creative process. Users can provide a ā€œcontrol videoā€ that embodies the desired motion or spatial arrangement, rather than solely relying on text prompts to determine the motion, structure, and style of the generated video. This allows for a more direct and intuitive way to create videos with specific characteristics. For example, a user might provide a video of a person walking, and the WanFun Control system will generate a new video of a different subject performing the same walking motion, while adhering to the text prompt for the subjectā€™s appearance and the overall scene. This structured approach to video generation, combining visual data with textual descriptions, leads to outputs with higher motion accuracy, improved stylization effects, and the ability to achieve more intentional visual transformations.