# ActRep-R1: Solving the Challenge of Video Repetitive Action Counting with Multimodal Large Models and Reinforcement Learning

> ActRep-R1 is a post-training framework that adapts multimodal large language models to the video repetitive action counting task via structured reasoning and reinforcement learning, addressing the counting accuracy issues of traditional methods in complex scenarios.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-12T07:55:11.000Z
- 最近活动: 2026-05-12T07:59:20.770Z
- 热度: 150.9
- 关键词: 多模态大模型, 强化学习, 视频理解, 动作计数, GRPO, Qwen-VL, 计算机视觉, 深度学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/actrep-r1
- Canonical: https://www.zingnex.cn/forum/thread/actrep-r1
- Markdown 来源: floors_fallback

---

## ActRep-R1: Solving Video Repetitive Action Counting with Multimodal LLMs & RL (Introduction)

ActRep-R1 is an innovative post-training framework that addresses the challenges of video repetitive action counting (RAC) by combining structured reasoning and reinforcement learning (RL) to adapt multimodal large language models (MLLMs) to the task. It aims to improve counting accuracy in complex scenarios where traditional methods fall short, leveraging models like Qwen-VL series. This post will break down its background, technical approach, performance, and applications.

## Background: Challenges in Repetitive Action Counting

RAC has wide applications (fitness, industrial quality inspection, medical rehab) but faces key issues:
1. Poor time modeling: Hard to capture long-term temporal dependencies in videos.
2. Limited generalization: Unstable performance across varying angles, lighting, or action variants.
3. Lack of interpretability: No clear reasoning behind count results.
While MLLMs show strong visual understanding, applying them to precise counting tasks remains an open problem—ActRep-R1 targets this gap.

## Core Design of ActRep-R1 Framework

ActRep-R1 (by Yicheng Qiu et al.) is an open-source framework based on Qwen-VL series models (Qwen2-VL, Qwen2.5-VL, Qwen3-VL). Its core ideas:
- **Structured reasoning**: Instead of direct number output, it first generates structured analysis of video content.
- **RL integration**: Uses reinforcement learning to enhance counting accuracy.
Key capabilities: Explicit reasoning, temporal awareness (understanding action cycles), self-verification (improving accuracy via validation steps).

## Three-Stage Training Pipeline

ActRep-R1's training process has three critical stages:
1. **CoT Data Generation**: Builds training data with detailed reasoning steps (how to observe video, analyze cycles, handle boundaries) instead of just video + number labels.
2. **Supervised Fine-Tuning (SFT)**: Uses CoT data to teach the model to follow reasoning formats (action recognition/location, cycle boundary detection, temporal consistency check, final count).
3. **Group Relative Policy Optimization (GRPO)**: Innovative RL step without separate reward models—uses group relative comparison to estimate advantages. It also uses Random Count Sampling (RCS) to address count distribution imbalance (more small counts in data) by stratified sampling.

## Hybrid Reward Function Design

The reward function combines two parts:
1. **Count Accuracy Reward**: Considers not only exact matches but also "Off-By-One" tolerance (e.g., 19/21 for true count 20 gets partial reward) to avoid sparse rewards.
2. **Format Compliance Reward**: Ensures the model outputs follow predefined reasoning structures (analysis steps, validation links) to maintain interpretability.
This design balances numerical precision and reasoning chain integrity.

## Engineering Implementation & Best Practices

**Toolchain**:
- Training: Supports DeepSpeed ZeRO-2/3, CPU Offload, multi-GPU evaluation.
- Model support: Built-in Qwen series, modular for new VL models.
- Evaluation: Local checkpoint assessment, API comparison (OpenAI/Gemini), metrics like OBO accuracy, exact match, MAE, RMSE.
- Data: CSV/JSONL support, flexible preprocessing, configurable max pixels.

**Best Practices**:
- Reduce memory usage: Set `--max_pixels` (e.g.,262144=672×384) for high-res videos to cut token count from ~20K to ~2.5K.
- Training stability: Avoid in-training generation-based evaluation (causes DeepSpeed ZeRO-3 hang; evaluate post-training).
- Batch calculation: Equivalent batch size = GPU count × per-device batch × gradient accumulation steps.

## Application Prospects & Academic Value

ActRep-R1's value:
- **Practical use**: Validated on RepCount dataset, open-source with docs—ready for fitness, industrial, medical applications.
- **Academic contribution**: Explores how to apply MLLM reasoning to fine-grained visual tasks (like RAC), setting a "think first, output later" paradigm for other tasks (object counting, motion analysis, quality assessment).
It retains MLLMs' generalization while achieving professional counting precision, making it a valuable resource for video understanding and RL in vision tasks.
