Zing Forum

Reading

BFMD: The First Full-Court Dense Badminton Dataset—Enabling AI to Understand the Tactical Intent of Every Shot

The research team from Nagoya Institute of Technology released BFMD, the first full-court dense badminton dataset, which includes 19 complete matches, 20 hours of video, and detailed annotations for 16,751 shot events. They also proposed a multi-modal shot description generation framework based on VideoMAE.

体育视频理解羽毛球数据集多模态学习视频描述生成计算机视觉动作识别VideoMAE战术分析深度学习数据集构建
Published 2026-03-26 23:09Recent activity 2026-03-28 07:55Estimated read 6 min
BFMD: The First Full-Court Dense Badminton Dataset—Enabling AI to Understand the Tactical Intent of Every Shot
1

Section 01

BFMD Dataset & Multi-modal Framework: Enabling AI to Understand Badminton Tactics

The teams from Nagoya Institute of Technology and Nagoya University released BFMD, the first full-court dense badminton dataset, which includes 19 complete professional matches (12 singles, 7 doubles), 20.32 hours of video, and detailed annotations for 16,751 shot events. They also proposed a multi-modal shot description generation framework based on VideoMAE, aiming to enable AI to generate accurate and tactically insightful shot descriptions from videos, thus advancing the field of sports video understanding.

2

Section 02

Background: Limitations of Existing Badminton Datasets

Existing badminton datasets have two major limitations: 1. Temporal fragmentation: Only short clips are included, lacking match continuity, leading to missing context and limited tactical analysis; 2. Single modality: Most only provide RGB videos, lacking key information such as shuttlecock trajectory, player posture, and on-court positions. In contrast, tennis and table tennis already have more comprehensive datasets (e.g., 3DTennisDS, THETIS, OpenTTGames), so the badminton field urgently needs a fully structured dataset.

3

Section 03

BFMD Dataset: Scale & Annotation System

BFMD's data is sourced from 19 top-tier events on the official BWF YouTube channel. Statistical data: 19 matches (12 singles,7 doubles), 20.32 hours of duration,1687 rallies,16751 shots. A three-level annotation system is adopted:1. Match segments (rallies, replays, Hawk-Eye replays);2. Rally events (shots, shuttlecock landing, net touches);3. Dense rally annotations (shot type, shuttlecock trajectory, player bounding boxes, posture key points, shot descriptions). The annotation process is human-machine collaboration: GPT-4.1 assists in generating initial results, then three annotators with over 5 years of experience review and revise, followed by iterative feedback.

4

Section 04

Multi-modal Hit Description Framework

The framework is based on VideoMAE, with the core innovation being a semantic feedback mechanism, consisting of four components:1. VideoMAE visual encoder + Token refiner (enhances feature interaction);2. Multi-modal fusion module (encodes and fuses player positions, posture key points, and shuttlecock trajectory);3. Transformer description decoder (autoregressively generates text);4. Semantic feedback module (parallelly predicts semantic attributes and feeds back to enhance representation). The training objective is multi-task learning: description generation loss (cross-entropy) + semantic feedback loss (multi-label binary cross-entropy), with a weight coefficient of 0.1.

5

Section 05

Experimental Results: Validating Multi-modal Value

Comparative experiments show that this framework outperforms traditional visual description models (e.g., SoccerNet-Caption), pre-trained VLMs (e.g., Vid2Seq), and zero-shot large VLMs (e.g., Qwen2.5-VL). Ablation experiments indicate: both the Token refiner and semantic feedback module improve performance; among multi-modal inputs, trajectory brings the most significant improvement, and the combination of all modalities yields the best results. In qualitative analysis, the model successfully identifies smashes, net shots, etc., but easily confuses similar actions like lifts and net shots.

6

Section 06

Limitations & Future Directions

Current limitations:1. Only uses singles data, no doubles processing;2. Relies on manually annotated shot events;3. Fine-grained shot types are easily confused. Future directions:1. Expand to full-court video understanding;2. Optimize for real-time applications;3. Cross-sport transfer (tennis, table tennis);4. Develop interactive analysis tools.