Section 01
[Introduction] GRW Dataset: A Large-Scale Benchmark for In-the-Wild Gesture Recognition
The GRW dataset is the first large-scale benchmark dataset for speech-accompanied gesture recognition designed specifically for in-the-wild environments. It contains 156,688 manually annotated video clips covering 150 vocabulary categories. This dataset addresses bottlenecks in existing gesture recognition data, such as small scale, limited scenarios, and insufficient annotation granularity. It defines three core tasks: semantic gesture classification, vocabulary-correspondence recognition, and temporal localization, providing important support for the training and evaluation of multimodal AI models.