正文

ActRep-R1：用多模态大模型强化学习解决视频重复动作计数难题

ActRep-R1 是一个后训练框架，通过结构化推理和强化学习将多模态大语言模型适配到视频重复动作计数任务，解决了传统方法在复杂场景下的计数精度问题。

多模态大模型强化学习视频理解动作计数GRPOQwen-VL计算机视觉深度学习

发布时间 2026/05/12 15:55最近活动 2026/05/12 15:59预计阅读 7 分钟

章节 01

ActRep-R1: Solving Video Repetitive Action Counting with Multimodal LLMs & RL (导读)

ActRep-R1 is an innovative post-training framework that addresses the challenges of video repetitive action counting (RAC) by combining structured reasoning and reinforcement learning (RL) to adapt multimodal large language models (MLLMs) to the task. It aims to improve counting accuracy in complex scenarios where traditional methods fall short, leveraging models like Qwen-VL series. This post will break down its background, technical approach, performance, and applications.

章节 02

Background: Challenges in Repetitive Action Counting

RAC has wide applications (fitness, industrial质检, medical rehab) but faces key issues:

Poor time modeling: Hard to capture long-term temporal dependencies in videos.
Limited generalization: Unstable performance across varying angles, lighting, or action variants.
Lack of interpretability: No clear reasoning behind count results. While MLLMs show strong visual understanding, applying them to precise counting tasks remains an open problem—ActRep-R1 targets this gap.

章节 03

Core Design of ActRep-R1 Framework

ActRep-R1 (by Yicheng Qiu et al.) is an open-source framework based on Qwen-VL series models (Qwen2-VL, Qwen2.5-VL, Qwen3-VL). Its core ideas:

Structured reasoning: Instead of direct number output, it first generates structured analysis of video content.
RL integration: Uses reinforcement learning to enhance counting accuracy. Key capabilities: Explicit reasoning, temporal awareness (understanding action cycles), self-verification (improving accuracy via validation steps).

章节 04

Three-Stage Training Pipeline

ActRep-R1's training process has three critical stages:

CoT Data Generation: Builds training data with detailed reasoning steps (how to observe video, analyze cycles, handle boundaries) instead of just video + number labels.
Supervised Fine-Tuning (SFT): Uses CoT data to teach the model to follow reasoning formats (action recognition/location, cycle boundary detection, temporal consistency check, final count).
Group Relative Policy Optimization (GRPO): Innovative RL step without separate reward models—uses group relative comparison to estimate advantages. It also uses Random Count Sampling (RCS) to address count distribution imbalance (more small counts in data) by stratified sampling.

章节 05

Hybrid Reward Function Design

The reward function combines two parts:

Count Accuracy Reward: Considers not only exact matches but also "Off-By-One" tolerance (e.g., 19/21 for true count 20 gets partial reward) to avoid sparse rewards.
Format Compliance Reward: Ensures the model outputs follow predefined reasoning structures (analysis steps, validation links) to maintain interpretability. This design balances numerical precision and reasoning chain integrity.

章节 06

Engineering Implementation & Best Practices

Toolchain:

Training: Supports DeepSpeed ZeRO-2/3, CPU Offload, multi-GPU evaluation.
Model support: Built-in Qwen series, modular for new VL models.
Evaluation: Local checkpoint assessment, API comparison (OpenAI/Gemini), metrics like OBO accuracy, exact match, MAE, RMSE.
Data: CSV/JSONL support, flexible preprocessing, configurable max pixels.

Best Practices:

Reduce显存: Set --max_pixels (e.g.,262144=672×384) for high-res videos to cut token count from ~20K to ~2.5K.
Training stability: Avoid in-training generation-based evaluation (causes DeepSpeed ZeRO-3 hang; evaluate post-training).
Batch calculation: Equivalent batch size = GPU count × per-device batch × gradient accumulation steps.

章节 07

Application Prospects & Academic Value

ActRep-R1's value:

Practical use: Validated on RepCount dataset, open-source with docs—ready for fitness, industrial, medical applications.
Academic contribution: Explores how to apply MLLM reasoning to fine-grained visual tasks (like RAC), setting a "think first, output later" paradigm for other tasks (object counting, motion analysis, quality assessment). It retains MLLMs' generalization while achieving professional counting precision, making it a valuable resource for video understanding and RL in vision tasks.