Reading

math-qa-llm: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking

A large language model reasoning system for math competition scenarios, supporting both free-form answers and multiple-choice questions. It adopts an adaptive two-stage reasoning strategy and a self-consistency voting mechanism to achieve efficient and accurate math problem solving on public datasets.

math-qa-llmQwen3-4B-Thinking数学推理大型语言模型自适应推理自我一致性QLoRAGRPO强化学习多阶段推理

Published 2026-05-26 05:36Recent activity 2026-05-26 05:49Estimated read 9 min

$math-qa-llm: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking$

Section 01

math-qa-llm Project Introduction: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking

math-qa-llm is a large language model reasoning system for math competition scenarios, supporting both free-form answers and multiple-choice questions. It uses an adaptive two-stage reasoning strategy and a self-consistency voting mechanism to achieve efficient and accurate math problem solving on public datasets. The project is maintained by sardorsob, sourced from GitHub (link: https://github.com/sardorsob/math-qa-llm), and updated on 2026-05-25T21:36:35Z.

Section 02

Project Background and Motivation

Math problem solving is a core benchmark for evaluating LLM reasoning capabilities, requiring precise symbolic computation, multi-step logical deduction, and strict answer formatting. Traditional end-to-end fine-tuning struggles to capture the complexity of mathematical reasoning—especially competition-level problems that demand deep thinking, self-verification, and error correction abilities. math-qa-llm is built for the CSE151B course competition task, implementing a complete workflow from data loading to result submission, and introducing an adaptive multi-stage reasoning strategy to improve reasoning quality.

Section 03

Core Architecture and Technology Selection

Base Model Selection

Qwen/Qwen3-4B-Thinking-2507 is chosen as the base model. This is a small LLM optimized for reasoning, with 4B parameters balancing reasoning ability and runtime efficiency on consumer-grade hardware (e.g., A30 GPU with 24GB VRAM). The Thinking variant enhances long-chain reasoning capabilities (displaying intermediate processes via ... tags).

Environment Adaptation Strategy

vLLM Path: Suitable for CUDA13+ environments, supporting high-throughput batch reasoning and enabling self-consistency voting with N=8
Transformers Path: For older environments like CUDA12.8, using HuggingFace Transformers' model.generate() for block-wise batch generation The dual-path design ensures stable operation across environments from local workstations to cloud A100 instances.

Section 04

Adaptive Two-Stage Reasoning Mechanism

Stage 1: Fast Initial Screening

Configuration parameters: Thinking budget (1024 tokens for Transformers path / 4096 tokens for vLLM path), maximum output length (4096 tokens for Transformers / 6144 tokens for vLLM), sampling temperature 0.6, sampling count N=1. The goal is to quickly generate initial answers and filter difficult problems via uncertainty signals.

Stage 2: Deep Retry and Self-Consistency

Enhanced configuration for uncertain problems: Thinking budget (4096 tokens for Transformers / 8192 tokens for vLLM), maximum output length (5120 tokens for Transformers / 6144 tokens for vLLM), sampling temperature 0.65, repetition penalty 1.05, sampling count N=3 (Transformers) /8 (vLLM). A majority voting mechanism selects high-frequency answers to improve accuracy.

Chunked Batch Processing and Checkpoints

The Transformers path uses CHUNK_SIZE=6 to balance memory and efficiency; a fine-grained checkpoint mechanism writes results to checkpoint.jsonl, supporting resume from interruptions.

Section 05

Training Optimization: QLoRA and GRPO Reinforcement Learning

QLoRA Supervised Fine-Tuning

Quantization config: 4-bit NF4 quantization + double quantization
Sequence length: 4096 tokens (A30) /8192 tokens (A100)
Learning rate: 5e-5
Training data: 15,000 high-quality questions from the NuminaMath dataset
Training epochs: 2

GRPO Reinforcement Learning Optimization

Group size G:4 (A30)/8 (A100)
Maximum generation length:2048 tokens
Learning rate:5e-7
KL divergence coefficient Beta:0.1 GRPO does not require an additional value network, making training more stable and efficient, and guiding the model to learn reliable reasoning strategies.

Section 06

Answer Extraction and Scoring Mechanism

Supports two answer formats:

Free-form answers: Wrap the final answer with \boxed{...}, supporting multiple answer slots [ANS]
Multiple-choice answers: Directly output option letters (A/B/C/D/E) The scoring module judger.py implements numerical tolerance judgment (handling floating-point precision) and division-by-zero protection to ensure correct recognition of mathematically equivalent answers.

Section 07

Performance Expectations and Experimental Results

Expected accuracy under different configurations:

Configuration Stage	Expected Accuracy
Baseline (Bug Fixes Only)	47-52%
+ QLoRA Fine-Tuning	≥42% (Baseline Retained)
+ GRPO Reinforcement Learning (G=8)	60-75%
Best Case (POLARIS Equivalent Configuration)	Up to79%
Metrics are based on latest research results like DAPO, Dr. GRPO, and POLARIS-4B, demonstrating the potential of small reasoning models in specific domains.

Section 08

Technical Highlights and Project Insights

Technical Highlights

Adaptive computing allocation: Dynamically adjust reasoning resources based on problem difficulty
Self-consistency mechanism: Majority voting improves reliability for complex problems
Engineering robustness: Fine-grained checkpoints, environment adaptation, and dual-path backend ensure stable operation
Progressive optimization: Layered progression from baseline fixes to supervised fine-tuning and reinforcement learning

Insights

math-qa-llm proves that small models with 4B parameters can achieve excellent performance in complex mathematical reasoning tasks through well-designed reasoning strategies, adaptive computing, and reinforcement learning optimization. The 'small model + strong strategy' paradigm may become an important direction for domain-specific applications.