Zing Forum

Reading

sympy-rlvr: Using Symbolic Verifiers to Replace Reward Models, Enabling Small Language Models to Master Mathematical Reasoning

A fully custom-implemented GRPO training framework that uses SymPy as a symbolic verifier to provide verifiable rewards. It effectively enhances the mathematical reasoning capabilities of small language models without the need to train reward models or LLM judges.

强化学习数学推理GRPOSymPy符号验证小语言模型Qwen奖励模型课程学习
Published 2026-03-29 15:14Recent activity 2026-03-29 15:25Estimated read 8 min
sympy-rlvr: Using Symbolic Verifiers to Replace Reward Models, Enabling Small Language Models to Master Mathematical Reasoning
1

Section 01

sympy-rlvr: Using Symbolic Verifiers to Replace Reward Models to Enhance Small LLM's Mathematical Reasoning Capabilities

The sympy-rlvr project proposes an innovative approach: using SymPy symbolic verifiers to provide verifiable rewards, replacing traditional reward models or LLM judges. With a fully custom-implemented GRPO training framework, it effectively enhances the mathematical reasoning capabilities of small language models (e.g., Qwen2.5-0.5B/1.5B). Key advantages include no need for expensive reward model training, reliable reward signals, and low training resource thresholds (can be completed with consumer-grade GPUs).

2

Section 02

Project Background and Core Issues

The mainstream methods for improving LLM's mathematical capabilities currently have pain points: high training costs for reward models (relying on manual annotation or LLM-as-judge processes), noisy reward signals leading to unstable RL training, and the need for models with billions of parameters to be effective. The core insight of sympy-rlvr: mathematical answers have objective correctness and can be verified via symbolic computation libraries, eliminating the need for neural networks to guess whether answers are correct.

3

Section 03

Core Components of the Technical Architecture: Symbolic Verification and Problem Synthesis

Symbolic Verifier (sympy_resolver.py)

The core component of the system. It uses Grok-4 as a tool-using model, derives correct answers via SymPy functions (solve, integrate, etc.), and only trusts SymPy's symbolic computation results.

Problem Synthesis Pipeline (q_synthesis.py)

Generates mathematical problems of varying difficulty asynchronously and in parallel. Ensures answer accuracy through a double verification mechanism (difficult problems are solved independently twice; only those with consistent results are retained).

4

Section 04

Multi-Dimensional Reward Function and GRPO Training Loop

Multi-Signal Reward Function

Design of 8-dimensional dense rewards:

  • Correctness (weight:0.5): SymPy symbolic matching; partial rewards are given via exponential decay when answers are close to the correct one
  • Consistency (0.1): Whether the final answer aligns with the result of the last step in reasoning
  • Reasoning Depth (0.1): Number of operators and digits in reasoning
  • Numerical Grounding (0.1): Proportion of problem digits that appear in reasoning
  • Format Compliance (0.08): Whether XML tags (response, reasoning, final_answer) are used correctly
  • Parsability (0.07): Whether the final answer is in numerical form
  • Moderate Length (0.03): Reasoning word count is within the range of 50-400 words
  • Repetition Penalty (0.02): Penalties for repeated n-grams to prevent cyclic output

GRPO Training Loop

Fully manually implemented:

  • Rollout phase: Sample G answers per problem (eval mode)
  • Advantage calculation: Group relative normalization ((r-mean(r))/(std(r)+ε))
  • Loss function: PPO clipped ratio loss + KL divergence penalty (to prevent the policy from deviating from the SFT model)
  • LoRA fine-tuning: Freeze the base model and only train adapter parameters
5

Section 05

Progressive Curriculum Learning Strategy

Adopts a three-stage progressive training strategy (mimicking human learning process): Stage1 (Easy) → Stage2 (Medium) → Stage3 (Hard). Each stage: 3 epochs, G=8, learning rate=1e-5. Each stage loads the LoRA adapter from the previous stage to achieve continuous knowledge accumulation.

6

Section 06

Data Selection: Why Not Use GSM8K?

Reasons for choosing self-synthesized training data over GSM8K:

  1. Controllable difficulty: Adjust the problem difficulty knob to adapt to different training stages
  2. Diversity guarantee: Random topic injection ensures rich problem types
  3. Symbolic verification: All answers are verified via SymPy, eliminating annotation errors
7

Section 07

Tech Stack and Hardware Requirements

Tech stack and hardware requirements:

  • Base model: Qwen2.5-0.5B /1.5B
  • Verifier: Grok-4 via xai_sdk
  • Training framework: PyTorch + Accelerate (no TRL used)
  • Experiment tracking: MLflow (parameters, metrics, artifacts, etc.)
  • Hardware: NVIDIA RTX series (RTX2000 ADA for SFT, RunPod for GRPO). Consumer-grade GPUs are sufficient for training.
8

Section 08

Implications for LLM Training Paradigms and Summary

Implications for LLM Training Paradigms

  • In specific domains (math, code, logic), symbolic verifiers are more reliable and efficient than neural network reward models
  • Small LLMs (0.5B-1.5B parameters) can achieve excellent performance with appropriate training methods, opening up possibilities for resource-constrained scenarios

Summary and Outlook

sympy-rlvr is a carefully designed mathematical reasoning training framework that enhances small LLM capabilities through symbolic verification rewards, multi-dimensional reward functions, and progressive curriculum learning. The open-source implementation (including the fully custom-written GRPO loop) provides a reference for researchers, and its methodology (finding domain-verifiable signals, multi-dimensional rewards, progressive training) can be extended to more scenarios.