Zing Forum

Reading

AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly Tasks

AssemLM is a spatial reasoning multimodal large language model specifically designed for robotic assembly tasks. By integrating assembly manuals, point cloud data, and text instructions, it enables the inference and prediction of key 6D assembly poses, and has achieved leading performance on the AssemBench benchmark with over 900,000 samples.

空间推理多模态大语言模型机器人装配6D位姿估计点云处理具身智能视觉语言模型
Published 2026-04-13 11:11Recent activity 2026-04-13 11:19Estimated read 7 min
AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly Tasks
1

Section 01

[Introduction] AssemLM: A Spatial Reasoning Multimodal Large Language Model for Robotic Assembly Tasks

AssemLM is proposed by China Telecom Artificial Intelligence Research Institute in collaboration with Fudan University, Tianjin University, Northwestern Polytechnical University, and City University of Hong Kong. It is a spatial reasoning multimodal large language model specifically designed for robotic assembly tasks. By integrating assembly manuals, point cloud data, and text instructions, it realizes the inference and prediction of key 6D assembly poses, and has achieved leading performance on the AssemBench benchmark with over 900,000 samples, providing an effective technical solution for the application of embodied intelligence in the field of industrial assembly.

2

Section 02

Research Background and Challenges

Research Background and Challenges

Spatial reasoning ability is one of the core foundational capabilities of embodied intelligence, and it is particularly critical for fine manipulation tasks such as robotic assembly. Although recent vision-language models (VLMs) have demonstrated preliminary spatial perception capabilities, they mainly rely on coarse-grained 2D perception and lack the ability to perform precise reasoning on 3D geometry. This limitation is especially evident in assembly tasks that require high-precision operations—robots not only need to "see" parts but also understand the 3D spatial relationships, orientations, and precise poses between parts.

Existing multimodal large language models face three challenges when handling assembly tasks: First, 2D image representations are difficult to capture fine-grained 3D geometric features; second, there is a lack of comprehensive datasets and evaluation benchmarks specifically for assembly tasks; third, how to effectively bridge raw 3D perception and high-level reasoning remains an unsolved technical problem.

3

Section 03

Core Architecture of AssemLM

Core Architecture of AssemLM

The core innovation of AssemLM lies in integrating three key information sources—assembly manuals, point cloud data, and text instructions—to realize the inference and prediction of 6D assembly poses. The model includes two key components: one is a specially designed point cloud encoder that directly processes 3D point cloud data to capture fine-grained geometric and rotation features; the other is a multimodal fusion module that combines point cloud features with the semantic understanding ability of language models to support precise 3D spatial reasoning.

4

Section 04

AssemBench Benchmark Dataset

AssemBench Benchmark Dataset

The research team built AssemBench—a large-scale dataset and evaluation benchmark containing over 900,000 multimodal samples, each with precise 6D pose annotations. It extends spatial reasoning evaluation from 2D to 3D geometric reasoning, filling the gap in embodied intelligence evaluation. The dataset covers scenarios with different complexities, part types, and assembly sequences, simulating the complex situations of real industrial environments.

5

Section 05

Experimental Results and Performance

Experimental Results and Performance

AssemLM achieves state-of-the-art performance in 6D pose reasoning tasks on the AssemBench benchmark, being able to accurately predict the target poses of parts and understand assembly spatial constraints and sequence dependencies. Validation on real robot platforms shows that the model supports fine-grained multi-step assembly execution, has good generalization ability and practical value, proving its potential to solve real-world problems.

6

Section 06

Technical Contributions and Application Prospects

Technical Contributions and Application Prospects

The technical contributions of AssemLM are reflected in three aspects: architecturally, it combines a 3D perception module with a general language model; data-wise, it provides the first large-scale assembly-oriented spatial reasoning benchmark; application-wise, it verifies practicality through real robot experiments. It provides a new path for intelligent manufacturing, and can serve as a core component of next-generation intelligent robots to support complex and flexible automated assembly operations.

7

Section 07

Summary and Outlook

Summary and Outlook

AssemLM is an important progress of multimodal large language models in the field of embodied intelligence, targeting the solution of spatial reasoning and 3D geometry understanding problems. The research team has open-sourced the code and project page. In the future, they will explore more complex assembly scenarios, multi-robot collaborative assembly, and seamless integration with other manufacturing links.