Reading

FinReason: Enhancing Small Model Financial Numerical Reasoning via Verifiable Reward Reinforcement Learning

FinReason is an innovative project that combines Supervised Fine-Tuning (SFT) with the GRPO reinforcement learning algorithm, using verifiable numerical correctness as the reward signal to successfully train the Qwen2.5-1.5B small model into a professional model capable of accurately answering numerical questions about financial statements.

FinReason金融数值推理强化学习GRPO可验证奖励Qwen2.5小语言模型FinQA

Published 2026-04-05 14:37Recent activity 2026-04-05 14:49Estimated read 8 min

FinReason: Enhancing Small Model Financial Numerical Reasoning via Verifiable Reward Reinforcement Learning

Section 01

FinReason Project Overview: Boosting Small Model Financial Numerical Reasoning with Verifiable Reward RL

FinReason is an innovative project that trains the Qwen2.5-1.5B small model (1.5B parameters) to accurately answer financial statement numerical questions. It uses a two-stage approach: Supervised Fine-Tuning (SFT) combined with Group Relative Policy Optimization (GRPO) reinforcement learning, with verifiable numerical correctness as the reward signal. The project aims to address the challenges of large models (hallucinations, high deployment cost) by enabling small models to achieve professional-level performance in specific financial tasks, while being hardware-friendly for resource-constrained environments.

Section 02

Project Background & Research Motivation

Large language models (LLMs) like GPT-4 face core challenges in financial numerical reasoning: hallucinations and errors in precise calculations. Moreover, LLMs have high deployment costs, making them unsuitable for resource-limited settings. The FinReason project by Florida University's OmSPatel20 team explores whether small language models (SLMs) can reach near-large-model performance in specific domains via advanced training techniques.

Section 03

Two-Stage Training Architecture

FinReason uses a two-stage training pipeline:

Supervised Fine-Tuning (SFT): Uses the FinQA dataset (financial QA benchmark with real financial statement questions) and QLoRA (4-bit quantization) for efficient fine-tuning, reducing memory usage. This stage helps the model learn financial language patterns and basic numerical reasoning formats.
GRPO Reinforcement Learning: Adopts the GRPO algorithm (from DeepSeek-R1 paper) with a simple yet effective reward function: whether the answer's numerical value is correct. This verifiable reward avoids the high cost of manual preference data in traditional RLHF. GRPO compares relative quality of candidate answers within a group to update the policy, which fits naturally with numerical correctness verification.

Section 04

Hardware & Technical Implementation Details

Hardware Compatibility: Supports consumer-grade hardware: RTX4060 (8GB, batch=1), Google Colab free (T4 16GB, batch=2), Colab Pro (A100 40GB, can try Qwen2.5-3B).
Modular Scripts: Provides a full pipeline of independent scripts (environment check, data exploration, zero-shot baseline, data formatting, SFT/GRPO training, evaluation, analysis).
Streamlit Demo: Includes an interactive app for users to input financial questions and view the model's reasoning process and answers.

Section 05

Training Strategies & Practical Tuning Tips

Zero-shot Baseline Check: Before formal training, establish a zero-shot baseline. If accuracy <2%, switch to a larger model (e.g., Qwen2.5-3B) to avoid wasting resources.
Memory Optimization: For OOM issues: reduce MAX_SEQ_LEN (SFT) to 512; reduce NUM_GENERATIONS (GRPO) to 2 and MAX_NEW_TOKENS to128; use Unsloth library (saves ~30% memory, auto fallback to PEFT if installation fails).
Reward Debugging: If GRPO reward is always zero, extend SFT training to ensure the model generates parsable answer formats first.

Section 06

Practical Application Value & Significance

Domain Specialization Path: Proves SLMs can reach practical levels in vertical domains via targeted post-training, offering a feasible path for AI applications in finance, law, medical fields (instead of relying on general LLMs).
Verifiable Reward Paradigm: The numerical correctness reward mechanism can be extended to tasks with objective criteria (code execution, math solving, logical reasoning).
Open Source Contribution: Built on open-source tools (Qwen2.5, TRL, PEFT) and open-sources all training code and data processing workflows, promoting knowledge sharing.

Section 07

Limitations & Future Directions

Limitations:

Dataset scope: Only uses FinQA, covering limited financial scenarios.
Model scale: Main experiments use 1.5B model; larger models' potential is not fully explored.
Generalization: Performance on out-of-distribution financial documents needs further verification.

Future Directions: Integrate more financial data sources; explore multi-modal capabilities (table/chart understanding); extend the method to other precise numerical reasoning domains.

Section 08

Summary & Key Takeaways

FinReason demonstrates how well-designed training strategies can enable small models to deliver great value in specific tasks. Key insights:

Model capability depends not only on parameter count but also on training method design and domain data utilization.
Provides a validated blueprint for resource-constrained AI deployment.
Highlights the importance of open-source collaboration for technological progress.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15