# Application of Vision-Language Models in Gait Screening: Zero-Shot and Multimodal Context Learning

> The Vera Research team open-sourced the research code and dataset of vision-language models for gait classification screening, exploring the application of zero-shot learning and multimodal context learning in the detection of Parkinson's disease and knee osteoarthritis, and found that the multimodal ICL method can significantly narrow the performance gap with dedicated video encoders.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-10T08:10:00.000Z
- 最近活动: 2026-06-10T08:23:01.772Z
- 热度: 154.8
- 关键词: 视觉语言模型, 步态分析, 医学筛查, 帕金森病, 膝骨关节炎, 多模态学习, 上下文学习, 零样本学习, V-JEPA, SigLIP
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-vera-research-vlm-gait-screening
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-vera-research-vlm-gait-screening
- Markdown 来源: floors_fallback

---

## Application of Vision-Language Models in Gait Screening: Guide to Zero-Shot and Multimodal Context Learning

The Vera Research team open-sourced the research code and dataset of vision-language models for gait classification screening, exploring the application of zero-shot learning and multimodal context learning in the detection of Parkinson's disease and knee osteoarthritis. Core conclusion: Zero-shot vision-language models perform poorly, but similarity-guided multimodal in-context learning (ICL) can significantly narrow the performance gap with dedicated video encoders. This study provides important insights for the application of general AI models in specialized medical fields.

## Research Background and Motivation

Gait analysis is an important tool for early screening of neurodegenerative diseases (e.g., Parkinson's disease) and musculoskeletal diseases (e.g., knee osteoarthritis). However, traditional methods rely on professional assessment and expensive equipment, limiting large-scale application. In recent years, VLMs have demonstrated strong zero-shot and multimodal capabilities. This study aims to explore their performance in medical gait analysis and the possibility of replacing or assisting traditional methods.

## Research Objectives and Dataset

### Classification Tasks
Focus on three types of gait classification: normal gait, Parkinson's disease gait, knee osteoarthritis gait

### Dataset
Using the public KOA-PD-NM dataset, a subject-exclusive split strategy is adopted to prevent identity leakage:

| Dataset Split | Knee Osteoarthritis (KOA) | Normal | Parkinson's Disease (PD) | Total |
|---------------|---------------------------|--------|--------------------------|-------|
| Support Set   | 8 people                  | 4 people | 2 people                 | 14 people |
| Test Set      | 42 people                 | 26 people |14 people                 |82 people |

This ensures that the model faces unseen subjects during testing, which is closer to real-world scenarios.

## Experimental Models and Methods

### Evaluated Vision-Language Models
| Model | Type | Scale | Access Method |
|-------|------|-------|---------------|
| Gemma 4 | Open-source | E2B / E4B /31B | Local execution |
| Qwen3-VL | Open-source |8B /32B | Local execution |
| Gemini 2.5 Flash | Closed-source |- | API call |

### Baseline Comparison
V-JEPA 2 + kNN (self-supervised video encoder + k-nearest neighbor classifier)

### Four-Level Prompt Strategy
| Level | Name | Description |
|-------|------|-------------|
| L0 | Direct Classification | Return the label directly |
| L1 | Classify After Description | First give a free description then return the label |
| L2 | Structured Gait Analysis | Analyze six gait features then return the label |
| L3 | Multimodal ICL | Classify after using two similarity-guided support samples |

### Multimodal ICL Mechanism
1. SigLIP 2 extracts frame embeddings from test/support videos
2. Calculate cosine similarity
3. Select Top2 support samples as context
4. Input to VLM for classification

Similarity guidance ensures the context is visually relevant to the test sample.

## Key Research Findings

### Finding 1: Zero-Shot VLMs Perform Poorly
The best macro-average F1 score is only 0.360, indicating that it is difficult to identify gait abnormalities without domain examples, highlighting the complexity of professional knowledge in the medical field.

### Finding 2: Multimodal ICL Significantly Improves Performance
The macro-average F1 score of multimodal ICL reaches 0.771, which greatly narrows the gap with the V-JEPA 2 baseline (0.791). General VLMs can approach the performance of dedicated models.

### Finding3: Visual Examples Are the Dominant Factor
Visual support samples have the greatest impact on performance, while prompt structure, model scale, etc., have smaller impacts and are model-family specific.

## Research Significance and Application Prospects

### Medical Screening Field
- Reduce equipment threshold: Ordinary cameras can be used for analysis
- Improve accessibility: Cloud API supports remote areas
- Assist diagnosis: Enhance screening efficiency and consistency

### Reference for Multimodal Learning
Provide methodology for medical image analysis and verify the value of similarity-guided sample selection.

### Deployment Recommendations
1. Do not rely on pure zero-shot methods
2. Establish a high-quality support sample library
3. Optimize retrieval with visual similarity
4. Prioritize open-source models (Gemma4/Qwen3-VL)

### Research Significance
Provide insights for the application of general AI in specialized medical fields: Domain examples and prompt strategies are more important than model scale.

## Limitations and Future Directions

### Current Limitations
- Small dataset size, generalization ability needs verification
- Only three types of gait, while clinical scenarios are more complex
- No in-depth exploration of the impact of video length

### Future Directions
- Expand the dataset to more gait types and subjects
- Explore the ability to quantify gait features (step length/step frequency)
- Study the feasibility of real-time gait monitoring
- Integrate wearable sensors and video data to improve accuracy