# HandVQA: Diagnose and Improve Fine-Grained Spatial Reasoning of Hand in Vision-Language Models

> This article introduces the HandVQA project accepted by CVPR 2026, a large-scale 3D-annotated hand visual question answering (VQA) benchmark dataset containing over 1.6 million samples. It is designed to diagnose and improve the fine-grained reasoning capabilities of vision-language models (VLMs) in terms of hand joint angles, distances, and spatial positions.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-30T17:45:16.000Z
- 最近活动: 2026-03-30T17:51:33.102Z
- 热度: 159.9
- 关键词: 视觉语言模型, 手部识别, VQA, 3D标注, 空间推理, CVPR, 多模态, 数据集
- 页面链接: https://www.zingnex.cn/en/forum/thread/handvqa
- Canonical: https://www.zingnex.cn/forum/thread/handvqa
- Markdown 来源: floors_fallback

---

## Introduction / Main Post: HandVQA: Diagnose and Improve Fine-Grained Spatial Reasoning of Hand in Vision-Language Models

This article introduces the HandVQA project accepted by CVPR 2026, a large-scale 3D-annotated hand visual question answering (VQA) benchmark dataset containing over 1.6 million samples. It is designed to diagnose and improve the fine-grained reasoning capabilities of vision-language models (VLMs) in terms of hand joint angles, distances, and spatial positions.

## Research Background and Motivation

Although current vision-language models perform well on general visual understanding tasks, they still have obvious shortcomings in handling fine-grained spatial relationships of hands. The hand is a highly articulated structure with 27 bones and multiple degrees of freedom, and its posture changes are complex and subtle. Existing VQA datasets mainly focus on object-level recognition and relationships, lacking specialized evaluation for hand joint-level spatial reasoning. The HandVQA project fills this gap by constructing a controlled question-answering dataset based on 3D hand joint annotations, providing researchers with a tool to accurately diagnose VLMs' hand understanding capabilities.

## Data Sources

HandVQA is built based on three well-known hand datasets:
- **FreiHAND**: Contains 3D joint annotations of real hands
- **InterHand2.6M**: Large-scale two-hand interaction dataset
- **FPHA**: First-person hand action dataset

These datasets provide high-quality 3D hand joint position annotations, laying the foundation for generating geometrically accurate question-answer pairs.

## Question Generation Strategy

HandVQA converts 3D hand joints into geometry-based posture descriptors and controlled multiple-choice questions. Specifically, the project defines five types of spatial reasoning questions:

1. **Angle**: Asks about the angle of a specific joint, e.g., "What is the bending angle of the first joint of the index finger?"
2. **Distance**: Asks about the distance between two joint points
3. **Relative Position X**: Left-right positional relationship along the X-axis
4. **Relative Position Y**: Up-down positional relationship along the Y-axis
5. **Relative Position Z**: Front-back positional relationship along the Z-axis

## Deterministic Supervision

Unlike VQA datasets that require manual annotation, all labels in HandVQA are directly computed from the geometry of 3D hand joints. This deterministic supervision ensures 100% accuracy of the labels and eliminates the interference of annotation noise on model evaluation.

## Dataset Statistics and Features

HandVQA contains over 1.6 million VQA samples, making it the largest hand-specific VQA dataset to date. Its main features include:

- **Scale**: 1.6M+ VQA samples
- **Format**: JSONL annotation files + image compression packages
- **Supervision Type**: Deterministic labels based on 3D joint geometry
- **Question Types**: Angle, distance, relative positions along X/Y/Z axes

The scale of the dataset ensures sufficient model training, while the diverse reasoning types fully cover all dimensions of hand spatial understanding.

## Research Findings and Insights

Through the HandVQA benchmark test, the research team discovered some important phenomena:

## Limitations of Existing VLMs

Even the most powerful vision-language models still perform poorly in terms of fine hand joint postures and precise geometric reasoning. This indicates that current mainstream architectures have structural flaws in handling fine-grained spatial relationships.