# sympy-rlvr: Using Symbolic Verifiers to Replace Reward Models, Enabling Small Language Models to Master Mathematical Reasoning

> A fully custom-implemented GRPO training framework that uses SymPy as a symbolic verifier to provide verifiable rewards. It effectively enhances the mathematical reasoning capabilities of small language models without the need to train reward models or LLM judges.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-29T07:14:16.000Z
- 最近活动: 2026-03-29T07:25:30.484Z
- 热度: 161.8
- 关键词: 强化学习, 数学推理, GRPO, SymPy, 符号验证, 小语言模型, Qwen, 奖励模型, 课程学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/sympy-rlvr
- Canonical: https://www.zingnex.cn/forum/thread/sympy-rlvr
- Markdown 来源: floors_fallback

---

## sympy-rlvr: Using Symbolic Verifiers to Replace Reward Models to Enhance Small LLM's Mathematical Reasoning Capabilities

The sympy-rlvr project proposes an innovative approach: using SymPy symbolic verifiers to provide verifiable rewards, replacing traditional reward models or LLM judges. With a fully custom-implemented GRPO training framework, it effectively enhances the mathematical reasoning capabilities of small language models (e.g., Qwen2.5-0.5B/1.5B). Key advantages include no need for expensive reward model training, reliable reward signals, and low training resource thresholds (can be completed with consumer-grade GPUs).

## Project Background and Core Issues

The mainstream methods for improving LLM's mathematical capabilities currently have pain points: high training costs for reward models (relying on manual annotation or LLM-as-judge processes), noisy reward signals leading to unstable RL training, and the need for models with billions of parameters to be effective. The core insight of sympy-rlvr: mathematical answers have objective correctness and can be verified via symbolic computation libraries, eliminating the need for neural networks to guess whether answers are correct.

## Core Components of the Technical Architecture: Symbolic Verification and Problem Synthesis

### Symbolic Verifier (sympy_resolver.py)
The core component of the system. It uses Grok-4 as a tool-using model, derives correct answers via SymPy functions (solve, integrate, etc.), and only trusts SymPy's symbolic computation results.
### Problem Synthesis Pipeline (q_synthesis.py)
Generates mathematical problems of varying difficulty asynchronously and in parallel. Ensures answer accuracy through a double verification mechanism (difficult problems are solved independently twice; only those with consistent results are retained).

## Multi-Dimensional Reward Function and GRPO Training Loop

### Multi-Signal Reward Function
Design of 8-dimensional dense rewards:
- Correctness (weight:0.5): SymPy symbolic matching; partial rewards are given via exponential decay when answers are close to the correct one
- Consistency (0.1): Whether the final answer aligns with the result of the last step in reasoning
- Reasoning Depth (0.1): Number of operators and digits in reasoning
- Numerical Grounding (0.1): Proportion of problem digits that appear in reasoning
- Format Compliance (0.08): Whether XML tags (response, reasoning, final_answer) are used correctly
- Parsability (0.07): Whether the final answer is in numerical form
- Moderate Length (0.03): Reasoning word count is within the range of 50-400 words
- Repetition Penalty (0.02): Penalties for repeated n-grams to prevent cyclic output

### GRPO Training Loop
Fully manually implemented:
- Rollout phase: Sample G answers per problem (eval mode)
- Advantage calculation: Group relative normalization ((r-mean(r))/(std(r)+ε))
- Loss function: PPO clipped ratio loss + KL divergence penalty (to prevent the policy from deviating from the SFT model)
- LoRA fine-tuning: Freeze the base model and only train adapter parameters

## Progressive Curriculum Learning Strategy

Adopts a three-stage progressive training strategy (mimicking human learning process): Stage1 (Easy) → Stage2 (Medium) → Stage3 (Hard). Each stage: 3 epochs, G=8, learning rate=1e-5. Each stage loads the LoRA adapter from the previous stage to achieve continuous knowledge accumulation.

## Data Selection: Why Not Use GSM8K?

Reasons for choosing self-synthesized training data over GSM8K:
1. Controllable difficulty: Adjust the problem difficulty knob to adapt to different training stages
2. Diversity guarantee: Random topic injection ensures rich problem types
3. Symbolic verification: All answers are verified via SymPy, eliminating annotation errors

## Tech Stack and Hardware Requirements

Tech stack and hardware requirements:
- Base model: Qwen2.5-0.5B /1.5B
- Verifier: Grok-4 via xai_sdk
- Training framework: PyTorch + Accelerate (no TRL used)
- Experiment tracking: MLflow (parameters, metrics, artifacts, etc.)
- Hardware: NVIDIA RTX series (RTX2000 ADA for SFT, RunPod for GRPO). Consumer-grade GPUs are sufficient for training.

## Implications for LLM Training Paradigms and Summary

### Implications for LLM Training Paradigms
- In specific domains (math, code, logic), symbolic verifiers are more reliable and efficient than neural network reward models
- Small LLMs (0.5B-1.5B parameters) can achieve excellent performance with appropriate training methods, opening up possibilities for resource-constrained scenarios

### Summary and Outlook
sympy-rlvr is a carefully designed mathematical reasoning training framework that enhances small LLM capabilities through symbolic verification rewards, multi-dimensional reward functions, and progressive curriculum learning. The open-source implementation (including the fully custom-written GRPO loop) provides a reference for researchers, and its methodology (finding domain-verifiable signals, multi-dimensional rewards, progressive training) can be extended to more scenarios.
