# math-qa-llm: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking

> A large language model reasoning system for math competition scenarios, supporting both free-form answers and multiple-choice questions. It adopts an adaptive two-stage reasoning strategy and a self-consistency voting mechanism to achieve efficient and accurate math problem solving on public datasets.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-25T21:36:35.000Z
- 最近活动: 2026-05-25T21:49:54.501Z
- 热度: 173.8
- 关键词: math-qa-llm, Qwen3-4B-Thinking, 数学推理, 大型语言模型, 自适应推理, 自我一致性, QLoRA, GRPO, 强化学习, 多阶段推理, 多数投票, CSE 151B, 数学问题求解, 推理优化, 模型微调
- 页面链接: https://www.zingnex.cn/en/forum/thread/math-qa-llm-qwen3-4b-thinking-1e15ae6c
- Canonical: https://www.zingnex.cn/forum/thread/math-qa-llm-qwen3-4b-thinking-1e15ae6c
- Markdown 来源: floors_fallback

---

## math-qa-llm Project Introduction: A Math Problem Solving Pipeline Based on Qwen3-4B-Thinking

math-qa-llm is a large language model reasoning system for math competition scenarios, supporting both free-form answers and multiple-choice questions. It uses an adaptive two-stage reasoning strategy and a self-consistency voting mechanism to achieve efficient and accurate math problem solving on public datasets. The project is maintained by sardorsob, sourced from GitHub (link: https://github.com/sardorsob/math-qa-llm), and updated on 2026-05-25T21:36:35Z.

## Project Background and Motivation

Math problem solving is a core benchmark for evaluating LLM reasoning capabilities, requiring precise symbolic computation, multi-step logical deduction, and strict answer formatting. Traditional end-to-end fine-tuning struggles to capture the complexity of mathematical reasoning—especially competition-level problems that demand deep thinking, self-verification, and error correction abilities. math-qa-llm is built for the CSE151B course competition task, implementing a complete workflow from data loading to result submission, and introducing an adaptive multi-stage reasoning strategy to improve reasoning quality.

## Core Architecture and Technology Selection

### Base Model Selection
Qwen/Qwen3-4B-Thinking-2507 is chosen as the base model. This is a small LLM optimized for reasoning, with 4B parameters balancing reasoning ability and runtime efficiency on consumer-grade hardware (e.g., A30 GPU with 24GB VRAM). The Thinking variant enhances long-chain reasoning capabilities (displaying intermediate processes via <think>...</think> tags).
### Environment Adaptation Strategy
- vLLM Path: Suitable for CUDA13+ environments, supporting high-throughput batch reasoning and enabling self-consistency voting with N=8
- Transformers Path: For older environments like CUDA12.8, using HuggingFace Transformers' model.generate() for block-wise batch generation
The dual-path design ensures stable operation across environments from local workstations to cloud A100 instances.

## Adaptive Two-Stage Reasoning Mechanism

### Stage 1: Fast Initial Screening
Configuration parameters: Thinking budget (1024 tokens for Transformers path / 4096 tokens for vLLM path), maximum output length (4096 tokens for Transformers / 6144 tokens for vLLM), sampling temperature 0.6, sampling count N=1. The goal is to quickly generate initial answers and filter difficult problems via uncertainty signals.
### Stage 2: Deep Retry and Self-Consistency
Enhanced configuration for uncertain problems: Thinking budget (4096 tokens for Transformers / 8192 tokens for vLLM), maximum output length (5120 tokens for Transformers / 6144 tokens for vLLM), sampling temperature 0.65, repetition penalty 1.05, sampling count N=3 (Transformers) /8 (vLLM). A majority voting mechanism selects high-frequency answers to improve accuracy.
### Chunked Batch Processing and Checkpoints
The Transformers path uses CHUNK_SIZE=6 to balance memory and efficiency; a fine-grained checkpoint mechanism writes results to checkpoint.jsonl, supporting resume from interruptions.

## Training Optimization: QLoRA and GRPO Reinforcement Learning

### QLoRA Supervised Fine-Tuning
- Quantization config: 4-bit NF4 quantization + double quantization
- Sequence length: 4096 tokens (A30) /8192 tokens (A100)
- Learning rate: 5e-5
- Training data: 15,000 high-quality questions from the NuminaMath dataset
- Training epochs: 2
### GRPO Reinforcement Learning Optimization
- Group size G:4 (A30)/8 (A100)
- Maximum generation length:2048 tokens
- Learning rate:5e-7
- KL divergence coefficient Beta:0.1
GRPO does not require an additional value network, making training more stable and efficient, and guiding the model to learn reliable reasoning strategies.

## Answer Extraction and Scoring Mechanism

Supports two answer formats:
1. Free-form answers: Wrap the final answer with \\boxed{...}, supporting multiple answer slots [ANS]
2. Multiple-choice answers: Directly output option letters (A/B/C/D/E)
The scoring module judger.py implements numerical tolerance judgment (handling floating-point precision) and division-by-zero protection to ensure correct recognition of mathematically equivalent answers.

## Performance Expectations and Experimental Results

Expected accuracy under different configurations:
| Configuration Stage | Expected Accuracy |
|---------|-----------|
| Baseline (Bug Fixes Only) |47-52%|
| + QLoRA Fine-Tuning |≥42% (Baseline Retained)|
| + GRPO Reinforcement Learning (G=8)|60-75%|
| Best Case (POLARIS Equivalent Configuration)|Up to79%|
Metrics are based on latest research results like DAPO, Dr. GRPO, and POLARIS-4B, demonstrating the potential of small reasoning models in specific domains.

## Technical Highlights and Project Insights

### Technical Highlights
1. Adaptive computing allocation: Dynamically adjust reasoning resources based on problem difficulty
2. Self-consistency mechanism: Majority voting improves reliability for complex problems
3. Engineering robustness: Fine-grained checkpoints, environment adaptation, and dual-path backend ensure stable operation
4. Progressive optimization: Layered progression from baseline fixes to supervised fine-tuning and reinforcement learning
### Insights
math-qa-llm proves that small models with 4B parameters can achieve excellent performance in complex mathematical reasoning tasks through well-designed reasoning strategies, adaptive computing, and reinforcement learning optimization. The 'small model + strong strategy' paradigm may become an important direction for domain-specific applications.
