Zing Forum

Reading

ComfyUI Multimodal Prompt Generation Nodes: Connecting Visual Large Models and AIGC Workflows

ComfyUI-MultiModal-Prompt-Nodes is a plugin designed specifically for ComfyUI, supporting the generation and optimization of image/video generation prompts via local Qwen VL series models or Alibaba Cloud DashScope API. Its unique advantage lies in optimization for the Chinese context, providing an efficient prompt engineering solution for domestic multimodal models such as Qwen-Image-Edit and Wan2.2.

ComfyUIQwen多模态提示词工程视觉语言模型AIGCWan2.2图像生成视频生成GGUF
Published 2026-05-09 14:44Recent activity 2026-05-09 14:53Estimated read 6 min
ComfyUI Multimodal Prompt Generation Nodes: Connecting Visual Large Models and AIGC Workflows
1

Section 01

[Introduction] ComfyUI Multimodal Prompt Generation Nodes: Connecting Visual Large Models and AIGC Workflows

ComfyUI-MultiModal-Prompt-Nodes is a plugin designed specifically for ComfyUI, supporting the generation/optimization of image/video prompts via local Qwen VL series models or Alibaba Cloud DashScope API. Its core advantage lies in optimization for the Chinese context, providing an efficient prompt engineering solution for domestic multimodal models such as Qwen-Image-Edit and Wan2.2, lowering the threshold for AIGC creation.

2

Section 02

Project Background and Core Positioning

In the AIGC field, prompt engineering is key to generation quality, but it is difficult for ordinary users to write high-quality English prompts. As a ComfyUI custom node, this plugin uses Visual Large Language Models (VLM) to convert simple text/reference images into professional prompts, deeply optimizing Alibaba Cloud Qwen series and Wan2.2 video models to leverage performance advantages in the Chinese context.

3

Section 03

Core Features and Technical Innovations

  • Multimodal Input: Supports text→prompt, image→prompt, multi-image input (up to 3 images);
  • Flexible Style System: Built-in five styles: raw/default/detailed/concise/creative;
  • Localized Models: Supports Qwen2.5-VL/Qwen3-VL/Qwen3.5 (GGUF format, runs on CPU/GPU);
  • Cloud API: Integrates Alibaba Cloud DashScope API, supports image token compression to reduce costs.
4

Section 04

Deep Optimization for Domestic Models

  • Advantage of Chinese Prompts: Wan2.2/Qwen-Image-Edit have better understanding of Chinese prompts; it is recommended to set target_language to "zh";
  • Dedicated Nodes: Vision LLM (general purpose), Qwen Image Edit Prompt Generator (fixes system prompt issues), Wan2.2 Video Prompt Generator (supports 2048-token long text).
5

Section 05

Technical Implementation and Dependency Management

  • llama-cpp-python Version Compatibility:
    • Official 0.3.16: Supports Qwen2.5-VL, does not support Qwen3-VL/Qwen3.5;
    • JamePeng branch 0.3.21+: Supports Qwen2.5-VL/Qwen3-VL, does not support Qwen3.5;
    • JamePeng branch 0.3.33+: Supports all three models; It is recommended to use the JamePeng branch (requires custom compilation);
  • mmproj Automatic Detection: Supports automatic matching or manual selection of mmproj files;
  • Model Switching Stability: After v1.0.6, GGUF processing is improved, and mmproj is correctly reloaded when switching models.
6

Section 06

Installation and Configuration Guide

  • Standard Installation: Clone to the ComfyUI/custom_nodes directory, execute pip install -r requirements.txt;
  • Model Organization: Place GGUF models in ComfyUI/models/LLM/ or models/text_encoders/;
  • API Configuration: Create api_key.txt in the plugin directory, fill in the Alibaba Cloud DashScope secret key.
7

Section 07

Usage Scenarios and Best Practices

  • Application Scenarios: Image generation (optimize Stable Diffusion/FLUX prompts), image editing (generate Qwen-Image-Edit instructions), video generation (Wan2.2 long text support);
  • Recommended Configurations: Privacy priority → local Qwen3-VL; Quality priority → Qwen-VL-Max cloud; Cost optimization → enable save_tokens; Best effect → target_language=zh.
8

Section 08

Limitations, Version Updates and Conclusion

  • Limitations: Qwen2.5-VL has insufficient instruction following; environmental dependencies affect visual input effects; v1.0.10 upgrade requires reselecting models;
  • Version Updates: v1.0.10 expands the model path to models/text_encoders; v1.0.9 fixes system prompt bugs; v1.0.8 supports llama-cpp-python 0.3.16 image input; v1.0.6 improves model processing;
  • Conclusion: The plugin solves the pain point of prompt generation, has significant localized adaptation value, reflecting trends such as the rise of the domestic model ecosystem and prompt automation.