Reading

Application of Vision-Language Models in Gait Screening: Zero-Shot and Multimodal Context Learning

视觉语言模型步态分析医学筛查帕金森病膝骨关节炎多模态学习上下文学习零样本学习V-JEPASigLIP

Published 2026-06-10 16:10Recent activity 2026-06-10 16:23Estimated read 8 min

Application of Vision-Language Models in Gait Screening: Zero-Shot and Multimodal Context Learning

Section 01

Application of Vision-Language Models in Gait Screening: Guide to Zero-Shot and Multimodal Context Learning

The Vera Research team open-sourced the research code and dataset of vision-language models for gait classification screening, exploring the application of zero-shot learning and multimodal context learning in the detection of Parkinson's disease and knee osteoarthritis. Core conclusion: Zero-shot vision-language models perform poorly, but similarity-guided multimodal in-context learning (ICL) can significantly narrow the performance gap with dedicated video encoders. This study provides important insights for the application of general AI models in specialized medical fields.

Section 02

Research Background and Motivation

Gait analysis is an important tool for early screening of neurodegenerative diseases (e.g., Parkinson's disease) and musculoskeletal diseases (e.g., knee osteoarthritis). However, traditional methods rely on professional assessment and expensive equipment, limiting large-scale application. In recent years, VLMs have demonstrated strong zero-shot and multimodal capabilities. This study aims to explore their performance in medical gait analysis and the possibility of replacing or assisting traditional methods.

Section 03

Research Objectives and Dataset

Classification Tasks

Focus on three types of gait classification: normal gait, Parkinson's disease gait, knee osteoarthritis gait

Dataset

Using the public KOA-PD-NM dataset, a subject-exclusive split strategy is adopted to prevent identity leakage:

Dataset Split	Knee Osteoarthritis (KOA)	Normal	Parkinson's Disease (PD)	Total
Support Set	8 people	4 people	2 people	14 people
Test Set	42 people	26 people	14 people	82 people

This ensures that the model faces unseen subjects during testing, which is closer to real-world scenarios.

Section 04

Experimental Models and Methods

Evaluated Vision-Language Models

Model	Type	Scale	Access Method
Gemma 4	Open-source	E2B / E4B /31B	Local execution
Qwen3-VL	Open-source	8B /32B	Local execution
Gemini 2.5 Flash	Closed-source	-	API call

Baseline Comparison

V-JEPA 2 + kNN (self-supervised video encoder + k-nearest neighbor classifier)

Four-Level Prompt Strategy

Level	Name	Description
L0	Direct Classification	Return the label directly
L1	Classify After Description	First give a free description then return the label
L2	Structured Gait Analysis	Analyze six gait features then return the label
L3	Multimodal ICL	Classify after using two similarity-guided support samples

Multimodal ICL Mechanism

SigLIP 2 extracts frame embeddings from test/support videos
Calculate cosine similarity
Select Top2 support samples as context
Input to VLM for classification

Similarity guidance ensures the context is visually relevant to the test sample.

Section 05

Key Research Findings

Finding 1: Zero-Shot VLMs Perform Poorly

The best macro-average F1 score is only 0.360, indicating that it is difficult to identify gait abnormalities without domain examples, highlighting the complexity of professional knowledge in the medical field.

Finding 2: Multimodal ICL Significantly Improves Performance

The macro-average F1 score of multimodal ICL reaches 0.771, which greatly narrows the gap with the V-JEPA 2 baseline (0.791). General VLMs can approach the performance of dedicated models.

Finding3: Visual Examples Are the Dominant Factor

Visual support samples have the greatest impact on performance, while prompt structure, model scale, etc., have smaller impacts and are model-family specific.

Section 06

Research Significance and Application Prospects

Medical Screening Field

Reduce equipment threshold: Ordinary cameras can be used for analysis
Improve accessibility: Cloud API supports remote areas
Assist diagnosis: Enhance screening efficiency and consistency

Reference for Multimodal Learning

Provide methodology for medical image analysis and verify the value of similarity-guided sample selection.

Deployment Recommendations

Do not rely on pure zero-shot methods
Establish a high-quality support sample library
Optimize retrieval with visual similarity
Prioritize open-source models (Gemma4/Qwen3-VL)

Research Significance

Provide insights for the application of general AI in specialized medical fields: Domain examples and prompt strategies are more important than model scale.

Section 07

Limitations and Future Directions

Current Limitations

Small dataset size, generalization ability needs verification
Only three types of gait, while clinical scenarios are more complex
No in-depth exploration of the impact of video length

Future Directions

Expand the dataset to more gait types and subjects
Explore the ability to quantify gait features (step length/step frequency)
Study the feasibility of real-time gait monitoring
Integrate wearable sensors and video data to improve accuracy

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23