Zing Forum

Reading

Study on Failure Modes of Small Language Models in Intelligent RAG Workflows

A systematic evaluation of four small language models (SLMs) on financial document reasoning tasks, revealing dominant failure modes such as numerical errors and hallucinations in intelligent RAG workflows, and proposing a reusable 10-category error taxonomy and dual-review evaluation protocol.

Small Language ModelSLMRAGAgentic WorkflowFinancial ReasoningEvaluationFailure ModesQwenLlamaPhi
Published 2026-06-06 03:59Recent activity 2026-06-06 04:18Estimated read 7 min
Study on Failure Modes of Small Language Models in Intelligent RAG Workflows
1

Section 01

[Introduction] Study on Failure Modes of Small Language Models in Intelligent RAG Workflows

This paper conducts a systematic evaluation of four small language models (SLMs) on financial document reasoning tasks, revealing the dominant failure modes in intelligent RAG workflows and proposing a reusable error taxonomy and dual-review protocol.

Original Authors: Muhammad Ahmed Mufti, Usman Haroon (FAST National University) Source: GitHub Project GenAI_Project Link: https://github.com/UsmanHaroon1177/GenAI_Project Release Time: 2026-05-12

The core research objects include four SLMs: Qwen3-1.7B, SmolLM3-3B, Phi-4-mini, and Llama-3.1-8B, with GPT-OSS-120B used as a capability upper bound for comparison.

2

Section 02

Research Background and Motivation

With the popularity of LLMs, SLMs have become an application hotspot due to their low deployment cost and fast inference speed. But how do SLMs perform in scenarios like financial document analysis that require precise numerical calculation and complex reasoning? Which workflow—traditional RAG or agentic— is more suitable for SLMs?

This study aims to answer the above questions and provide references for the practical application of SLMs through a systematic evaluation of four mainstream SLMs on financial reasoning tasks.

3

Section 03

Research Methods and Design

Experimental Framework

  1. Retrieval Strategy: Combine BM25 sparse retrieval (top 50) and BGE-small dense retrieval (top50), deduplicate, then rerank with bge-reranker-v2-m3 to select the top 8 text chunks.
  2. Prompt Engineering: Adjust from conservative (98-99% refusal to answer) to relaxed RAG prompts, guiding the model to identify line items, perform step-by-step calculations, and output answers in a specified format.
  3. Agent Protocol: Follow the ReAct protocol; the model submits its first output without self-validation.

Evaluation System

  • 10-Category Error Taxonomy: Covers numerical calculation errors, hallucinations, format errors, etc.
  • Dual-Review Mechanism: Independent evaluation by Llama-3.3-70B and Qwen-2.5-72B.
  • Statistical Confidence: Wilson interval calculation for 95% confidence interval to ensure reliable results.
4

Section 04

Key Findings: Accuracy Comparison and Failure Modes

Accuracy Comparison

Agentic workflows lead to a significant drop in accuracy for all SLMs:

Model Simple RAG Accuracy Agentic RAG Accuracy
Qwen3-1.7B 39.3% [31.9,47.3] 12.7% [8.3,18.9]
SmolLM3-3B 28.7% [22.0,36.4] 13.3% [8.8,19.7]
Phi-4-mini 32.0% [25.1,39.8] 19.3% [13.8,26.4]
Llama-3.1-8B 32.7% [25.7,40.5] 6.0% [3.2,11.0]
GPT-OSS-120B 53.7% [45.7,61.5] 32.0% [25.1,39.8]

Failure Modes

  1. Numerical Calculation Errors: Multi-step arithmetic operations easily accumulate errors, especially for complex financial formulas.
  2. Hallucinations: Generate information inconsistent with retrieved content, more prominent in agentic workflows.
  3. Tool Usage Errors: Format or parameter transfer errors when calling external tools.
5

Section 05

Review Consistency and Practical Implications

Review Consistency

  • Cohen's κ coefficient: 0.6528 (substantial agreement)
  • RAGAS context recall Spearman correlation coefficient: 0.7767
  • 1498 dual-review samples provide a statistical basis

Practical Recommendations

  1. Simple RAG is more suitable for SLMs: Complex agentic workflows easily introduce more errors.
  2. Task-Model Matching: Financial reasoning requires precise calculation, so SLM capability boundaries should be fully considered.
  3. Reuse Evaluation Framework: The 10-category error taxonomy and dual-review protocol can be extended to other fields.
6

Section 06

Limitations and Future Directions

Limitations

  • Only experimented at zero temperature (T=0), no exploration of sampling variance.
  • Only tested one retrieval pipeline configuration.
  • Review models are all 70B-level; no smaller models or human reviews were introduced.
  • Incomplete Gemini 2.5 Flash experiment due to API quota limits.

Future Directions

  • Explore agent architectures more suitable for SLMs.
  • Develop specialized numerical reasoning modules.
  • Build more fine-grained error diagnosis tools.
7

Section 07

Research Conclusion

This study reveals the real performance of SLMs in financial reasoning tasks through rigorous experiments. The key finding—that agentic workflows are not always better than simple RAG (especially for SLMs)—provides practical guidance for the industry. As SLMs become popular in edge computing and other scenarios, understanding their capability boundaries and failure modes will become increasingly important.