Zing 论坛

正文

FinReason:基于可验证奖励强化学习提升小模型金融数值推理能力

FinReason是一个创新项目,通过监督微调(SFT)结合GRPO强化学习算法,使用可验证的数值正确性作为奖励信号,成功将Qwen2.5-1.5B小模型训练为能够准确回答财务报表数值问题的专业模型。

FinReason金融数值推理强化学习GRPO可验证奖励Qwen2.5小语言模型FinQA
发布时间 2026/04/05 14:37最近活动 2026/04/05 14:49预计阅读 7 分钟
FinReason:基于可验证奖励强化学习提升小模型金融数值推理能力
1

章节 01

FinReason Project Overview: Boosting Small Model Financial Numerical Reasoning with Verifiable Reward RL

FinReason is an innovative project that trains the Qwen2.5-1.5B small model (1.5B parameters) to accurately answer financial statement numerical questions. It uses a two-stage approach: Supervised Fine-Tuning (SFT) combined with Group Relative Policy Optimization (GRPO) reinforcement learning, with verifiable numerical correctness as the reward signal. The project aims to address the challenges of large models (hallucinations, high deployment cost) by enabling small models to achieve professional-level performance in specific financial tasks, while being hardware-friendly for resource-constrained environments.

2

章节 02

Project Background & Research Motivation

Large language models (LLMs) like GPT-4 face core challenges in financial numerical reasoning: hallucinations and errors in precise calculations. Moreover, LLMs have high deployment costs, making them unsuitable for resource-limited settings. The FinReason project by Florida University's OmSPatel20 team explores whether small language models (SLMs) can reach near-large-model performance in specific domains via advanced training techniques.

3

章节 03

Two-Stage Training Architecture

FinReason uses a two-stage training pipeline:

  1. Supervised Fine-Tuning (SFT): Uses the FinQA dataset (financial QA benchmark with real financial statement questions) and QLoRA (4-bit quantization) for efficient fine-tuning, reducing memory usage. This stage helps the model learn financial language patterns and basic numerical reasoning formats.
  2. GRPO Reinforcement Learning: Adopts the GRPO algorithm (from DeepSeek-R1 paper) with a simple yet effective reward function: whether the answer's numerical value is correct. This verifiable reward avoids the high cost of manual preference data in traditional RLHF. GRPO compares relative quality of candidate answers within a group to update the policy, which fits naturally with numerical correctness verification.
4

章节 04

Hardware & Technical Implementation Details

  • Hardware Compatibility: Supports consumer-grade hardware: RTX4060 (8GB, batch=1), Google Colab free (T4 16GB, batch=2), Colab Pro (A100 40GB, can try Qwen2.5-3B).
  • Modular Scripts: Provides a full pipeline of independent scripts (environment check, data exploration, zero-shot baseline, data formatting, SFT/GRPO training, evaluation, analysis).
  • Streamlit Demo: Includes an interactive app for users to input financial questions and view the model's reasoning process and answers.
5

章节 05

Training Strategies & Practical Tuning Tips

  • Zero-shot Baseline Check: Before formal training, establish a zero-shot baseline. If accuracy <2%, switch to a larger model (e.g., Qwen2.5-3B) to avoid wasting resources.
  • Memory Optimization: For OOM issues: reduce MAX_SEQ_LEN (SFT) to 512; reduce NUM_GENERATIONS (GRPO) to 2 and MAX_NEW_TOKENS to128; use Unsloth library (saves ~30% memory, auto fallback to PEFT if installation fails).
  • Reward Debugging: If GRPO reward is always zero, extend SFT training to ensure the model generates parsable answer formats first.
6

章节 06

Practical Application Value & Significance

  • Domain Specialization Path: Proves SLMs can reach practical levels in vertical domains via targeted post-training, offering a feasible path for AI applications in finance, law, medical fields (instead of relying on general LLMs).
  • Verifiable Reward Paradigm: The numerical correctness reward mechanism can be extended to tasks with objective criteria (code execution, math solving, logical reasoning).
  • Open Source Contribution: Built on open-source tools (Qwen2.5, TRL, PEFT) and open-sources all training code and data processing workflows, promoting knowledge sharing.
7

章节 07

Limitations & Future Directions

Limitations:

  • Dataset scope: Only uses FinQA, covering limited financial scenarios.
  • Model scale: Main experiments use 1.5B model; larger models' potential is not fully explored.
  • Generalization: Performance on out-of-distribution financial documents needs further verification.

Future Directions: Integrate more financial data sources; explore multi-modal capabilities (table/chart understanding); extend the method to other precise numerical reasoning domains.

8

章节 08

Summary & Key Takeaways

FinReason demonstrates how well-designed training strategies can enable small models to deliver great value in specific tasks. Core启示:

  • Model capability depends not only on parameter count but also on training method design and domain data utilization.
  • Provides a validated blueprint for resource-constrained AI deployment.
  • Highlights the importance of open-source collaboration for technological progress.