Zing Forum

Reading

Practical Vertical Fine-Tuning: Using 37 Data Points to Make Llama 3.1 8B Outperform Cutting-Edge Models in Banking Business Analysis

A 5-day vertical fine-tuning demonstration showing how to use 37 manually curated training data points on Fireworks AI to reduce the cost of Llama 3.1 8B for bank comparable company analysis tasks by 1000x, while maintaining quality levels competitive with GPT-5.5 and Claude Opus 4.7.

大语言模型垂直微调LoRAFireworks AI金融领域成本优化Llama 3.1模型评估
Published 2026-05-11 03:44Recent activity 2026-05-11 03:50Estimated read 7 min
Practical Vertical Fine-Tuning: Using 37 Data Points to Make Llama 3.1 8B Outperform Cutting-Edge Models in Banking Business Analysis
1

Section 01

Main Floor: Cost and Quality Breakthroughs in Vertical Fine-Tuning Llama 3.1 8B for Banking Business Analysis

A 5-day experimental demonstration shows that by vertically fine-tuning Llama 3.1 8B with 37 manually curated training data points on the Fireworks AI platform, it can achieve a 1000x cost reduction for bank comparable company analysis tasks while maintaining quality levels competitive with GPT-5.5 and Claude Opus 4.7. Key finding: After careful vertical fine-tuning, open-source models can match cutting-edge closed-source models in domain-specific tasks, with inference costs reduced to 1/1000 of the latter.

2

Section 02

Background: Core Requirements of Comparable Company Analysis and Pain Points of Cutting-Edge Models

Comparable company analysis is a daily task in the financial industry, requiring: 1) Correct valuation multiples (e.g., P/E, P/TBV for banks, avoiding industrial indicators); 2) Real data (no placeholders or estimates); 3) Clear source citations. However, experiments found that cutting-edge models cannot meet all three requirements simultaneously under production API settings (temperature=0.0, neutral prompts).

3

Section 03

Experimental Methods: Model Selection and Training/Evaluation Configuration

Model and Training Configuration

  • Base model: Llama 3.1 8B
  • Training method: Supervised Fine-Tuning (SFT) + LoRA (rank 16)
  • Training data: 37 manually curated examples (26 bank comparison tables, 5 FIG vs industrial comparisons, 6 mid-sized bank data points)
  • Training epochs: 5
  • Max context length: 4096
  • Batch size: 4096
  • Learning rate: 0.0002
  • Training cost: ~$0.03, taking 30 minutes

Evaluation Settings

  • Test set: 5 untrained banks (C, HBAN, WBS, UMBF, INDB)
  • Temperature: 0.0 (deterministic)
  • Comparison models: GPT-5.5, Claude Opus 4.7
  • System prompt: neutral prompt "You are a helpful financial analyst"
4

Section 04

Evidence: Evaluation Results and Key Indicator Comparison

Experimental results show that the fine-tuned model performs better in cost and multiple key dimensions:

Indicator Fine-tuned Llama 3.1 8B GPT-5.5 Claude Opus 4.7
Average composite score 77.1 83.4 87.0
Industrial indicator misuse rate 20% 40% 40%
Tier-3 source citation rate 100% 80% 80%
Hallucinations 0 0 3
Score variance 21 55 33
Single inference cost $0.00009 $0.0894 $0.1058
Cost multiple 994× 1,176×

The fine-tuned model won in 6 out of 9 evaluation dimensions, especially with 100% source citation accuracy and no hallucination issues.

5

Section 05

Technical Details and Lessons Learned

Evaluation Dimensions

Includes 5 FIG analyst-level criteria: Format correctness (25 points), Numerical rationality (25 points), Subcategory awareness (20 points), Citation quality (15 points), Format completeness (15 points).

Iteration Process

  • v1: Default parameters showed no learning effect; v2: Corrected evaluation method; v3: Adjusted hyperparameters to increase citation rate to 53%; v4: After adding data, misuse rate dropped to 20% and citation rate reached 100%.

Key Lessons

  1. Fireworks default parameters are not suitable for small datasets; 2. Loss curve needs to drop below 1.0 to be effective; 3. Evaluation methods must be objective; 4. Keep test sets to avoid overfitting; 5. Use temperature=0.0 for deterministic tasks; 6. Scoring criteria need to be context-aware; 7. Cutting-edge API parameters need testing; 8. Cost advantage is the core GTM point.
6

Section 06

Limitations Note

The experiment has four limitations: 1. Average quality gap: fine-tuned model (77 points) vs Claude Opus 4.7 (87 points); 2. Cutting-edge models will improve in common vertical domains; 3. Small test sample size (N=5); 4. Low format completeness score (4.4/15).

7

Section 07

Business Value and Replicable Strategies

Core Value

By diving deep into vertical workflows, identifying weaknesses of cutting-edge models, and achieving cost-quality balance via small-scale fine-tuning, this approach can be replicated across multiple domains:

Vertical Domain Workflow Gap High-Volume Scenario
Banking/Capital Markets Comparison tables, transaction screening Sell-side analysts perform thousands of comparisons monthly
Medical Claims Denial code disambiguation Millions of claims processed daily
Legal Contract clause classification Hundreds of contract reviews weekly
Logistics Invoice parsing 10,000+ documents processed daily
Insurance Policy review Thousands of underwriting checks daily

This method is revolutionary for high-volume vertical workloads where cost differences determine feasibility.