Zing Forum

Reading

HOI-MLLM: Open-World Human-Object Interaction Detection Driven by Multimodal Large Language Models

The HOI-MLLM project innovatively combines multimodal large language models (MLLMs) with chain-of-thought (CoT) reasoning to achieve open-world human-object interaction (HOI) detection, breaking through the limitations of traditional closed-set approaches and opening up new paths for visual understanding.

人物交互检测多模态大模型思维链推理开放世界计算机视觉视觉问答MLLM
Published 2026-05-02 03:38Recent activity 2026-05-02 03:51Estimated read 7 min
HOI-MLLM: Open-World Human-Object Interaction Detection Driven by Multimodal Large Language Models
1

Section 01

HOI-MLLM Project Overview: Open-World Human-Object Interaction Detection Driven by Multimodal Large Language Models

The HOI-MLLM project innovatively combines multimodal large language models (MLLMs) with chain-of-thought (CoT) reasoning to achieve open-world human-object interaction (HOI) detection, breaking through the limitations of traditional closed-set approaches and opening up new paths for visual understanding. Through a generative paradigm and interpretable reasoning mechanisms, the project addresses the problem of infinite interaction types in the real world.

2

Section 02

Background: Closed-Set Dilemma of Traditional HOI Detection

Human-Object Interaction (HOI) detection is a core task in computer vision, aiming to identify interaction relationships between humans and objects in images. Traditional methods are trained based on predefined interaction categories and can only recognize types present in the training data. The closed-set setting faces challenges in practical applications: interaction types in the real world are infinite, with new actions, tools, and scenarios emerging constantly. Fixed-category models easily fail in open scenarios, making breaking through closed-set limitations a key issue.

3

Section 03

Methodology: Core Innovations of HOI-MLLM - Generative Paradigm and MLLM Application

HOI-MLLM leverages the generalization capabilities of MLLMs to handle open-world HOI detection. MLLMs are trained on massive image-text data, possessing rich knowledge of visual concepts and language description abilities. The project reformulates HOI detection as a visual question-answering task: given an image, the model freely generates natural language descriptions instead of selecting from fixed categories, and the generative paradigm naturally supports open-world scenarios.

4

Section 04

Key Mechanism: Chain-of-Thought Reasoning Enhances Detection Accuracy and Interpretability

HOI-MLLM introduces the chain-of-thought (CoT) reasoning mechanism, analyzing scenes through explicit multi-step reasoning: first locate humans and objects, then analyze spatial relationships, and finally infer interaction types. Step-by-step reasoning improves detection accuracy and enhances interpretability and robustness. For example, when the model judges the interaction "a person cutting an apple", it can trace the reasoning chain: identify "person" and "apple" → notice spatial proximity → combine the presence of a knife → infer the "cutting" action. This transparent process is suitable for high-risk scenarios.

5

Section 05

Technical Architecture: Visual-Language Collaboration and Prompt Strategy Design

The technical architecture of HOI-MLLM follows the mainstream architecture of multimodal large models: a visual encoder (e.g., CLIP) converts images into visual feature sequences; these features and text prompts are input into the large language model; the language model autoregressively generates interaction descriptions. A key challenge is designing effective prompt strategies to guide the model to focus on interaction-related visual cues and generate structured and accurate descriptions. The project explores various prompt templates and fine-tuning strategies to balance general capabilities and task-specific performance for HOI.

6

Section 06

Application Prospects: Wide-Ranging Implementation Scenarios from Academia to Industry

Open-world HOI detection has broad application prospects: in intelligent surveillance, it can identify abnormal interactions (e.g., holding weapons, falling) without predefining all anomalies; in human-computer interaction, it supports natural instruction understanding (e.g., "pass the book"); in content creation/social media, it automatically generates descriptions for images and videos to support recommendation and moderation; in robotics, it provides a foundation for grasp planning and collaborative operations.

7

Section 07

Challenges and Future Directions: Next Steps for Open-World HOI Detection

HOI-MLLM still faces challenges: fine-grained interaction recognition (e.g., the difference between "cutting" and "peeling") and handling complex interaction scenarios with multiple people. Future directions include: integrating video temporal sequences to improve dynamic interaction understanding, introducing 3D spatial reasoning to handle occlusions and depth relationships, developing efficient fine-tuning methods, and building large-scale open-world HOI datasets. This project represents an important step in the evolution of visual understanding toward general and open directions, and it is worth attention and reference.