Zing Forum

Reading

Application of Vision-Language Models in Gait Screening: Zero-Shot and Multimodal Context Learning

The Vera Research team open-sourced the research code and dataset of vision-language models for gait classification screening, exploring the application of zero-shot learning and multimodal context learning in the detection of Parkinson's disease and knee osteoarthritis, and found that the multimodal ICL method can significantly narrow the performance gap with dedicated video encoders.

视觉语言模型步态分析医学筛查帕金森病膝骨关节炎多模态学习上下文学习零样本学习V-JEPASigLIP
Published 2026-06-10 16:10Recent activity 2026-06-10 16:23Estimated read 8 min
Application of Vision-Language Models in Gait Screening: Zero-Shot and Multimodal Context Learning
1

Section 01

Application of Vision-Language Models in Gait Screening: Guide to Zero-Shot and Multimodal Context Learning

The Vera Research team open-sourced the research code and dataset of vision-language models for gait classification screening, exploring the application of zero-shot learning and multimodal context learning in the detection of Parkinson's disease and knee osteoarthritis. Core conclusion: Zero-shot vision-language models perform poorly, but similarity-guided multimodal in-context learning (ICL) can significantly narrow the performance gap with dedicated video encoders. This study provides important insights for the application of general AI models in specialized medical fields.

2

Section 02

Research Background and Motivation

Gait analysis is an important tool for early screening of neurodegenerative diseases (e.g., Parkinson's disease) and musculoskeletal diseases (e.g., knee osteoarthritis). However, traditional methods rely on professional assessment and expensive equipment, limiting large-scale application. In recent years, VLMs have demonstrated strong zero-shot and multimodal capabilities. This study aims to explore their performance in medical gait analysis and the possibility of replacing or assisting traditional methods.

3

Section 03

Research Objectives and Dataset

Classification Tasks

Focus on three types of gait classification: normal gait, Parkinson's disease gait, knee osteoarthritis gait

Dataset

Using the public KOA-PD-NM dataset, a subject-exclusive split strategy is adopted to prevent identity leakage:

Dataset Split Knee Osteoarthritis (KOA) Normal Parkinson's Disease (PD) Total
Support Set 8 people 4 people 2 people 14 people
Test Set 42 people 26 people 14 people 82 people

This ensures that the model faces unseen subjects during testing, which is closer to real-world scenarios.

4

Section 04

Experimental Models and Methods

Evaluated Vision-Language Models

Model Type Scale Access Method
Gemma 4 Open-source E2B / E4B /31B Local execution
Qwen3-VL Open-source 8B /32B Local execution
Gemini 2.5 Flash Closed-source - API call

Baseline Comparison

V-JEPA 2 + kNN (self-supervised video encoder + k-nearest neighbor classifier)

Four-Level Prompt Strategy

Level Name Description
L0 Direct Classification Return the label directly
L1 Classify After Description First give a free description then return the label
L2 Structured Gait Analysis Analyze six gait features then return the label
L3 Multimodal ICL Classify after using two similarity-guided support samples

Multimodal ICL Mechanism

  1. SigLIP 2 extracts frame embeddings from test/support videos
  2. Calculate cosine similarity
  3. Select Top2 support samples as context
  4. Input to VLM for classification

Similarity guidance ensures the context is visually relevant to the test sample.

5

Section 05

Key Research Findings

Finding 1: Zero-Shot VLMs Perform Poorly

The best macro-average F1 score is only 0.360, indicating that it is difficult to identify gait abnormalities without domain examples, highlighting the complexity of professional knowledge in the medical field.

Finding 2: Multimodal ICL Significantly Improves Performance

The macro-average F1 score of multimodal ICL reaches 0.771, which greatly narrows the gap with the V-JEPA 2 baseline (0.791). General VLMs can approach the performance of dedicated models.

Finding3: Visual Examples Are the Dominant Factor

Visual support samples have the greatest impact on performance, while prompt structure, model scale, etc., have smaller impacts and are model-family specific.

6

Section 06

Research Significance and Application Prospects

Medical Screening Field

  • Reduce equipment threshold: Ordinary cameras can be used for analysis
  • Improve accessibility: Cloud API supports remote areas
  • Assist diagnosis: Enhance screening efficiency and consistency

Reference for Multimodal Learning

Provide methodology for medical image analysis and verify the value of similarity-guided sample selection.

Deployment Recommendations

  1. Do not rely on pure zero-shot methods
  2. Establish a high-quality support sample library
  3. Optimize retrieval with visual similarity
  4. Prioritize open-source models (Gemma4/Qwen3-VL)

Research Significance

Provide insights for the application of general AI in specialized medical fields: Domain examples and prompt strategies are more important than model scale.

7

Section 07

Limitations and Future Directions

Current Limitations

  • Small dataset size, generalization ability needs verification
  • Only three types of gait, while clinical scenarios are more complex
  • No in-depth exploration of the impact of video length

Future Directions

  • Expand the dataset to more gait types and subjects
  • Explore the ability to quantify gait features (step length/step frequency)
  • Study the feasibility of real-time gait monitoring
  • Integrate wearable sensors and video data to improve accuracy