Reading

Math Reasoning Training Framework for Qwen-based Vision-Language Models

An open-source end-to-end training framework for fine-tuning Qwen vision-language models on math reasoning datasets, integrating LoRA efficient fine-tuning, chain-of-thought prompting, and custom evaluation metrics.

Qwen视觉语言模型数学推理LoRA微调链式思维多模态大模型PEFTHuggingFace

Published 2026-04-30 07:32Recent activity 2026-04-30 10:00Estimated read 7 min

Section 01

Guide to the Math Reasoning Training Framework for Qwen-based Vision-Language Models

qwen-reasoning is an open-source end-to-end training framework focused on enhancing the math reasoning capabilities of Qwen vision-language models. Targeting math problems in image form (e.g., handwritten formulas, geometric figures), this framework integrates LoRA efficient fine-tuning, chain-of-thought prompting, and custom evaluation metrics to provide a complete fine-tuning workflow, addressing the challenges of multimodal large models in math reasoning tasks.

Section 02

Project Background and Core Objectives

Traditional text-based large models have demonstrated strong capabilities in handling math problems, but when faced with math problems in image form (e.g., handwritten formulas, geometric figures, exam paper screenshots), they need both visual understanding and logical reasoning abilities. The core objective of this project is to provide a complete fine-tuning solution that enables models to understand math problems in images and perform step-by-step reasoning.

Section 03

Core Technical Architecture and Methods

Model Construction and LoRA Configuration

LoRA fine-tuning is implemented using the PEFT library. The visual encoder parameters are frozen, and LoRA adapters (rank 16, scaling factor 32) are only injected into the attention layers of the language model (q_proj, k_proj, v_proj, o_proj), balancing memory efficiency, fast convergence, and flexible deployment.

Dataset Processing and Chain-of-Thought Prompting

Using a dedicated data loader and chain-of-thought prompting strategy, training samples with detailed reasoning processes are constructed (prompt template: User: [Image] Solve the math problem presented in the image. Think step-by-step. Assistant: [Reasoning process] Final Answer: [Answer]), forcing the model to learn explicit reasoning steps.

Training Workflow

Based on the Hugging Face Trainer framework, strategies such as mixed-precision training, gradient accumulation (8 steps), and cosine annealing learning rate (initial 2e-5) are used to ensure training efficiency and performance.

Section 04

Innovative Evaluation Metric System

Reasoning Compliance Score

Check if the model output follows format specifications: using <think>...</think> to wrap the reasoning process, substantial reasoning content (≥10 characters), and closed tags—accounting for a total of 0.5 points.

Answer Correctness and Efficiency Bonus

After standardizing the answer (removing spaces, lowercase, stripping LaTeX wrappers), exact matching accounts for 0.5 points; if the answer is correct and the output length is ≤600 characters, an additional 0.1 points bonus is awarded.

Comprehensive Score

The weighted sum of all items (0-1.1 points) is monitored in real-time during training to help identify format violations or verbosity issues.

Section 05

Application Scenarios and Practical Value

Educational Assistance: Train AI teaching assistants to grade math homework, answer questions, understand handwritten/printed problems, and provide steps.
Academic Research: Provide reproducible baselines for VLM math reasoning research, supporting testing of different training strategies.
Enterprise Applications: Process math content in financial reports, engineering drawings, and scientific research literature.

Section 06

Technical Dependencies and Deployment Recommendations

Core Dependencies

transformers, torch, peft, streamlit, fastapi, etc.

Deployment Recommendations

Use high-memory servers (e.g., A100/H100) during the training phase; deploy to low-cost GPU/CPU environments via quantization during the inference phase, with support from FastAPI and Streamlit for application modules.

Section 07

Summary and Outlook

qwen-reasoning demonstrates a complete VLM fine-tuning workflow, with each link reflecting engineering best practices, and the innovative evaluation metrics providing new ideas for measuring reasoning capabilities. This project provides a solid starting point for enhancing VLM capabilities in specific fields (math, physics, etc.). As multimodal technology develops, dedicated training frameworks will become more important.