Reading

ActRep-R1: Solving the Challenge of Video Repetitive Action Counting with Multimodal Large Models and Reinforcement Learning

ActRep-R1 is a post-training framework that adapts multimodal large language models to the video repetitive action counting task via structured reasoning and reinforcement learning, addressing the counting accuracy issues of traditional methods in complex scenarios.

多模态大模型强化学习视频理解动作计数GRPOQwen-VL计算机视觉深度学习

Published 2026-05-12 15:55Recent activity 2026-05-12 15:59Estimated read 7 min

ActRep-R1: Solving the Challenge of Video Repetitive Action Counting with Multimodal Large Models and Reinforcement Learning

Section 01

ActRep-R1: Solving Video Repetitive Action Counting with Multimodal LLMs & RL (Introduction)

ActRep-R1 is an innovative post-training framework that addresses the challenges of video repetitive action counting (RAC) by combining structured reasoning and reinforcement learning (RL) to adapt multimodal large language models (MLLMs) to the task. It aims to improve counting accuracy in complex scenarios where traditional methods fall short, leveraging models like Qwen-VL series. This post will break down its background, technical approach, performance, and applications.

Section 02

Background: Challenges in Repetitive Action Counting

RAC has wide applications (fitness, industrial quality inspection, medical rehab) but faces key issues:

Poor time modeling: Hard to capture long-term temporal dependencies in videos.
Limited generalization: Unstable performance across varying angles, lighting, or action variants.
Lack of interpretability: No clear reasoning behind count results. While MLLMs show strong visual understanding, applying them to precise counting tasks remains an open problem—ActRep-R1 targets this gap.

Section 03

Core Design of ActRep-R1 Framework

ActRep-R1 (by Yicheng Qiu et al.) is an open-source framework based on Qwen-VL series models (Qwen2-VL, Qwen2.5-VL, Qwen3-VL). Its core ideas:

Structured reasoning: Instead of direct number output, it first generates structured analysis of video content.
RL integration: Uses reinforcement learning to enhance counting accuracy. Key capabilities: Explicit reasoning, temporal awareness (understanding action cycles), self-verification (improving accuracy via validation steps).

Section 04

Three-Stage Training Pipeline

ActRep-R1's training process has three critical stages:

CoT Data Generation: Builds training data with detailed reasoning steps (how to observe video, analyze cycles, handle boundaries) instead of just video + number labels.
Supervised Fine-Tuning (SFT): Uses CoT data to teach the model to follow reasoning formats (action recognition/location, cycle boundary detection, temporal consistency check, final count).
Group Relative Policy Optimization (GRPO): Innovative RL step without separate reward models—uses group relative comparison to estimate advantages. It also uses Random Count Sampling (RCS) to address count distribution imbalance (more small counts in data) by stratified sampling.

Section 05

Hybrid Reward Function Design

The reward function combines two parts:

Count Accuracy Reward: Considers not only exact matches but also "Off-By-One" tolerance (e.g., 19/21 for true count 20 gets partial reward) to avoid sparse rewards.
Format Compliance Reward: Ensures the model outputs follow predefined reasoning structures (analysis steps, validation links) to maintain interpretability. This design balances numerical precision and reasoning chain integrity.

Section 06

Engineering Implementation & Best Practices

Toolchain:

Training: Supports DeepSpeed ZeRO-2/3, CPU Offload, multi-GPU evaluation.
Model support: Built-in Qwen series, modular for new VL models.
Evaluation: Local checkpoint assessment, API comparison (OpenAI/Gemini), metrics like OBO accuracy, exact match, MAE, RMSE.
Data: CSV/JSONL support, flexible preprocessing, configurable max pixels.

Best Practices:

Reduce memory usage: Set --max_pixels (e.g.,262144=672×384) for high-res videos to cut token count from ~20K to ~2.5K.
Training stability: Avoid in-training generation-based evaluation (causes DeepSpeed ZeRO-3 hang; evaluate post-training).
Batch calculation: Equivalent batch size = GPU count × per-device batch × gradient accumulation steps.

Section 07

Application Prospects & Academic Value

ActRep-R1's value:

Practical use: Validated on RepCount dataset, open-source with docs—ready for fitness, industrial, medical applications.
Academic contribution: Explores how to apply MLLM reasoning to fine-grained visual tasks (like RAC), setting a "think first, output later" paradigm for other tasks (object counting, motion analysis, quality assessment). It retains MLLMs' generalization while achieving professional counting precision, making it a valuable resource for video understanding and RL in vision tasks.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54