Zing Forum

Reading

HandVQA: Diagnose and Improve Fine-Grained Spatial Reasoning of Hand in Vision-Language Models

This article introduces the HandVQA project accepted by CVPR 2026, a large-scale 3D-annotated hand visual question answering (VQA) benchmark dataset containing over 1.6 million samples. It is designed to diagnose and improve the fine-grained reasoning capabilities of vision-language models (VLMs) in terms of hand joint angles, distances, and spatial positions.

视觉语言模型手部识别VQA3D标注空间推理CVPR多模态数据集
Published 2026-03-31 01:45Recent activity 2026-03-31 01:51Estimated read 6 min
HandVQA: Diagnose and Improve Fine-Grained Spatial Reasoning of Hand in Vision-Language Models
1

Section 01

Introduction / Main Post: HandVQA: Diagnose and Improve Fine-Grained Spatial Reasoning of Hand in Vision-Language Models

This article introduces the HandVQA project accepted by CVPR 2026, a large-scale 3D-annotated hand visual question answering (VQA) benchmark dataset containing over 1.6 million samples. It is designed to diagnose and improve the fine-grained reasoning capabilities of vision-language models (VLMs) in terms of hand joint angles, distances, and spatial positions.

2

Section 02

Research Background and Motivation

Although current vision-language models perform well on general visual understanding tasks, they still have obvious shortcomings in handling fine-grained spatial relationships of hands. The hand is a highly articulated structure with 27 bones and multiple degrees of freedom, and its posture changes are complex and subtle. Existing VQA datasets mainly focus on object-level recognition and relationships, lacking specialized evaluation for hand joint-level spatial reasoning. The HandVQA project fills this gap by constructing a controlled question-answering dataset based on 3D hand joint annotations, providing researchers with a tool to accurately diagnose VLMs' hand understanding capabilities.

3

Section 03

Data Sources

HandVQA is built based on three well-known hand datasets:

  • FreiHAND: Contains 3D joint annotations of real hands
  • InterHand2.6M: Large-scale two-hand interaction dataset
  • FPHA: First-person hand action dataset

These datasets provide high-quality 3D hand joint position annotations, laying the foundation for generating geometrically accurate question-answer pairs.

4

Section 04

Question Generation Strategy

HandVQA converts 3D hand joints into geometry-based posture descriptors and controlled multiple-choice questions. Specifically, the project defines five types of spatial reasoning questions:

  1. Angle: Asks about the angle of a specific joint, e.g., "What is the bending angle of the first joint of the index finger?"
  2. Distance: Asks about the distance between two joint points
  3. Relative Position X: Left-right positional relationship along the X-axis
  4. Relative Position Y: Up-down positional relationship along the Y-axis
  5. Relative Position Z: Front-back positional relationship along the Z-axis
5

Section 05

Deterministic Supervision

Unlike VQA datasets that require manual annotation, all labels in HandVQA are directly computed from the geometry of 3D hand joints. This deterministic supervision ensures 100% accuracy of the labels and eliminates the interference of annotation noise on model evaluation.

6

Section 06

Dataset Statistics and Features

HandVQA contains over 1.6 million VQA samples, making it the largest hand-specific VQA dataset to date. Its main features include:

  • Scale: 1.6M+ VQA samples
  • Format: JSONL annotation files + image compression packages
  • Supervision Type: Deterministic labels based on 3D joint geometry
  • Question Types: Angle, distance, relative positions along X/Y/Z axes

The scale of the dataset ensures sufficient model training, while the diverse reasoning types fully cover all dimensions of hand spatial understanding.

7

Section 07

Research Findings and Insights

Through the HandVQA benchmark test, the research team discovered some important phenomena:

8

Section 08

Limitations of Existing VLMs

Even the most powerful vision-language models still perform poorly in terms of fine hand joint postures and precise geometric reasoning. This indicates that current mainstream architectures have structural flaws in handling fine-grained spatial relationships.