Zing Forum

Reading

ActRep-R1: Solving the Challenge of Video Repetitive Action Counting with Multimodal Large Models and Reinforcement Learning

ActRep-R1 is a post-training framework that adapts multimodal large language models to the video repetitive action counting task via structured reasoning and reinforcement learning, addressing the counting accuracy issues of traditional methods in complex scenarios.

多模态大模型强化学习视频理解动作计数GRPOQwen-VL计算机视觉深度学习
Published 2026-05-12 15:55Recent activity 2026-05-12 15:59Estimated read 7 min
ActRep-R1: Solving the Challenge of Video Repetitive Action Counting with Multimodal Large Models and Reinforcement Learning
1

Section 01

ActRep-R1: Solving Video Repetitive Action Counting with Multimodal LLMs & RL (Introduction)

ActRep-R1 is an innovative post-training framework that addresses the challenges of video repetitive action counting (RAC) by combining structured reasoning and reinforcement learning (RL) to adapt multimodal large language models (MLLMs) to the task. It aims to improve counting accuracy in complex scenarios where traditional methods fall short, leveraging models like Qwen-VL series. This post will break down its background, technical approach, performance, and applications.

2

Section 02

Background: Challenges in Repetitive Action Counting

RAC has wide applications (fitness, industrial quality inspection, medical rehab) but faces key issues:

  1. Poor time modeling: Hard to capture long-term temporal dependencies in videos.
  2. Limited generalization: Unstable performance across varying angles, lighting, or action variants.
  3. Lack of interpretability: No clear reasoning behind count results. While MLLMs show strong visual understanding, applying them to precise counting tasks remains an open problem—ActRep-R1 targets this gap.
3

Section 03

Core Design of ActRep-R1 Framework

ActRep-R1 (by Yicheng Qiu et al.) is an open-source framework based on Qwen-VL series models (Qwen2-VL, Qwen2.5-VL, Qwen3-VL). Its core ideas:

  • Structured reasoning: Instead of direct number output, it first generates structured analysis of video content.
  • RL integration: Uses reinforcement learning to enhance counting accuracy. Key capabilities: Explicit reasoning, temporal awareness (understanding action cycles), self-verification (improving accuracy via validation steps).
4

Section 04

Three-Stage Training Pipeline

ActRep-R1's training process has three critical stages:

  1. CoT Data Generation: Builds training data with detailed reasoning steps (how to observe video, analyze cycles, handle boundaries) instead of just video + number labels.
  2. Supervised Fine-Tuning (SFT): Uses CoT data to teach the model to follow reasoning formats (action recognition/location, cycle boundary detection, temporal consistency check, final count).
  3. Group Relative Policy Optimization (GRPO): Innovative RL step without separate reward models—uses group relative comparison to estimate advantages. It also uses Random Count Sampling (RCS) to address count distribution imbalance (more small counts in data) by stratified sampling.
5

Section 05

Hybrid Reward Function Design

The reward function combines two parts:

  1. Count Accuracy Reward: Considers not only exact matches but also "Off-By-One" tolerance (e.g., 19/21 for true count 20 gets partial reward) to avoid sparse rewards.
  2. Format Compliance Reward: Ensures the model outputs follow predefined reasoning structures (analysis steps, validation links) to maintain interpretability. This design balances numerical precision and reasoning chain integrity.
6

Section 06

Engineering Implementation & Best Practices

Toolchain:

  • Training: Supports DeepSpeed ZeRO-2/3, CPU Offload, multi-GPU evaluation.
  • Model support: Built-in Qwen series, modular for new VL models.
  • Evaluation: Local checkpoint assessment, API comparison (OpenAI/Gemini), metrics like OBO accuracy, exact match, MAE, RMSE.
  • Data: CSV/JSONL support, flexible preprocessing, configurable max pixels.

Best Practices:

  • Reduce memory usage: Set --max_pixels (e.g.,262144=672×384) for high-res videos to cut token count from ~20K to ~2.5K.
  • Training stability: Avoid in-training generation-based evaluation (causes DeepSpeed ZeRO-3 hang; evaluate post-training).
  • Batch calculation: Equivalent batch size = GPU count × per-device batch × gradient accumulation steps.
7

Section 07

Application Prospects & Academic Value

ActRep-R1's value:

  • Practical use: Validated on RepCount dataset, open-source with docs—ready for fitness, industrial, medical applications.
  • Academic contribution: Explores how to apply MLLM reasoning to fine-grained visual tasks (like RAC), setting a "think first, output later" paradigm for other tasks (object counting, motion analysis, quality assessment). It retains MLLMs' generalization while achieving professional counting precision, making it a valuable resource for video understanding and RL in vision tasks.