Zing Forum

Reading

math-qa-llm: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking

A large language model reasoning system for math competition scenarios, supporting both free-form answers and multiple-choice questions. It adopts an adaptive two-stage reasoning strategy and a self-consistency voting mechanism to achieve efficient and accurate math problem solving on public datasets.

math-qa-llmQwen3-4B-Thinking数学推理大型语言模型自适应推理自我一致性QLoRAGRPO强化学习多阶段推理
Published 2026-05-26 05:36Recent activity 2026-05-26 05:49Estimated read 9 min
math-qa-llm: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking
1

Section 01

math-qa-llm Project Introduction: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking

math-qa-llm is a large language model reasoning system for math competition scenarios, supporting both free-form answers and multiple-choice questions. It uses an adaptive two-stage reasoning strategy and a self-consistency voting mechanism to achieve efficient and accurate math problem solving on public datasets. The project is maintained by sardorsob, sourced from GitHub (link: https://github.com/sardorsob/math-qa-llm), and updated on 2026-05-25T21:36:35Z.

2

Section 02

Project Background and Motivation

Math problem solving is a core benchmark for evaluating LLM reasoning capabilities, requiring precise symbolic computation, multi-step logical deduction, and strict answer formatting. Traditional end-to-end fine-tuning struggles to capture the complexity of mathematical reasoning—especially competition-level problems that demand deep thinking, self-verification, and error correction abilities. math-qa-llm is built for the CSE151B course competition task, implementing a complete workflow from data loading to result submission, and introducing an adaptive multi-stage reasoning strategy to improve reasoning quality.

3

Section 03

Core Architecture and Technology Selection

Base Model Selection

Qwen/Qwen3-4B-Thinking-2507 is chosen as the base model. This is a small LLM optimized for reasoning, with 4B parameters balancing reasoning ability and runtime efficiency on consumer-grade hardware (e.g., A30 GPU with 24GB VRAM). The Thinking variant enhances long-chain reasoning capabilities (displaying intermediate processes via ... tags).

Environment Adaptation Strategy

  • vLLM Path: Suitable for CUDA13+ environments, supporting high-throughput batch reasoning and enabling self-consistency voting with N=8
  • Transformers Path: For older environments like CUDA12.8, using HuggingFace Transformers' model.generate() for block-wise batch generation The dual-path design ensures stable operation across environments from local workstations to cloud A100 instances.
4

Section 04

Adaptive Two-Stage Reasoning Mechanism

Stage 1: Fast Initial Screening

Configuration parameters: Thinking budget (1024 tokens for Transformers path / 4096 tokens for vLLM path), maximum output length (4096 tokens for Transformers / 6144 tokens for vLLM), sampling temperature 0.6, sampling count N=1. The goal is to quickly generate initial answers and filter difficult problems via uncertainty signals.

Stage 2: Deep Retry and Self-Consistency

Enhanced configuration for uncertain problems: Thinking budget (4096 tokens for Transformers / 8192 tokens for vLLM), maximum output length (5120 tokens for Transformers / 6144 tokens for vLLM), sampling temperature 0.65, repetition penalty 1.05, sampling count N=3 (Transformers) /8 (vLLM). A majority voting mechanism selects high-frequency answers to improve accuracy.

Chunked Batch Processing and Checkpoints

The Transformers path uses CHUNK_SIZE=6 to balance memory and efficiency; a fine-grained checkpoint mechanism writes results to checkpoint.jsonl, supporting resume from interruptions.

5

Section 05

Training Optimization: QLoRA and GRPO Reinforcement Learning

QLoRA Supervised Fine-Tuning

  • Quantization config: 4-bit NF4 quantization + double quantization
  • Sequence length: 4096 tokens (A30) /8192 tokens (A100)
  • Learning rate: 5e-5
  • Training data: 15,000 high-quality questions from the NuminaMath dataset
  • Training epochs: 2

GRPO Reinforcement Learning Optimization

  • Group size G:4 (A30)/8 (A100)
  • Maximum generation length:2048 tokens
  • Learning rate:5e-7
  • KL divergence coefficient Beta:0.1 GRPO does not require an additional value network, making training more stable and efficient, and guiding the model to learn reliable reasoning strategies.
6

Section 06

Answer Extraction and Scoring Mechanism

Supports two answer formats:

  1. Free-form answers: Wrap the final answer with \boxed{...}, supporting multiple answer slots [ANS]
  2. Multiple-choice answers: Directly output option letters (A/B/C/D/E) The scoring module judger.py implements numerical tolerance judgment (handling floating-point precision) and division-by-zero protection to ensure correct recognition of mathematically equivalent answers.
7

Section 07

Performance Expectations and Experimental Results

Expected accuracy under different configurations:

Configuration Stage Expected Accuracy
Baseline (Bug Fixes Only) 47-52%
+ QLoRA Fine-Tuning ≥42% (Baseline Retained)
+ GRPO Reinforcement Learning (G=8) 60-75%
Best Case (POLARIS Equivalent Configuration) Up to79%
Metrics are based on latest research results like DAPO, Dr. GRPO, and POLARIS-4B, demonstrating the potential of small reasoning models in specific domains.
8

Section 08

Technical Highlights and Project Insights

Technical Highlights

  1. Adaptive computing allocation: Dynamically adjust reasoning resources based on problem difficulty
  2. Self-consistency mechanism: Majority voting improves reliability for complex problems
  3. Engineering robustness: Fine-grained checkpoints, environment adaptation, and dual-path backend ensure stable operation
  4. Progressive optimization: Layered progression from baseline fixes to supervised fine-tuning and reinforcement learning

Insights

math-qa-llm proves that small models with 4B parameters can achieve excellent performance in complex mathematical reasoning tasks through well-designed reasoning strategies, adaptive computing, and reinforcement learning optimization. The 'small model + strong strategy' paradigm may become an important direction for domain-specific applications.