# Practical Vertical Fine-Tuning: Using 37 Data Points to Make Llama 3.1 8B Outperform Cutting-Edge Models in Banking Business Analysis

> A 5-day vertical fine-tuning demonstration showing how to use 37 manually curated training data points on Fireworks AI to reduce the cost of Llama 3.1 8B for bank comparable company analysis tasks by 1000x, while maintaining quality levels competitive with GPT-5.5 and Claude Opus 4.7.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-10T19:44:39.000Z
- 最近活动: 2026-05-10T19:50:55.326Z
- 热度: 150.9
- 关键词: 大语言模型, 垂直微调, LoRA, Fireworks AI, 金融领域, 成本优化, Llama 3.1, 模型评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/37llama-3-1-8b
- Canonical: https://www.zingnex.cn/forum/thread/37llama-3-1-8b
- Markdown 来源: floors_fallback

---

## Main Floor: Cost and Quality Breakthroughs in Vertical Fine-Tuning Llama 3.1 8B for Banking Business Analysis

A 5-day experimental demonstration shows that by vertically fine-tuning Llama 3.1 8B with 37 manually curated training data points on the Fireworks AI platform, it can achieve a 1000x cost reduction for bank comparable company analysis tasks while maintaining quality levels competitive with GPT-5.5 and Claude Opus 4.7. Key finding: After careful vertical fine-tuning, open-source models can match cutting-edge closed-source models in domain-specific tasks, with inference costs reduced to 1/1000 of the latter.

## Background: Core Requirements of Comparable Company Analysis and Pain Points of Cutting-Edge Models

Comparable company analysis is a daily task in the financial industry, requiring: 1) Correct valuation multiples (e.g., P/E, P/TBV for banks, avoiding industrial indicators); 2) Real data (no placeholders or estimates); 3) Clear source citations. However, experiments found that cutting-edge models cannot meet all three requirements simultaneously under production API settings (temperature=0.0, neutral prompts).

## Experimental Methods: Model Selection and Training/Evaluation Configuration

### Model and Training Configuration
- Base model: Llama 3.1 8B
- Training method: Supervised Fine-Tuning (SFT) + LoRA (rank 16)
- Training data: 37 manually curated examples (26 bank comparison tables, 5 FIG vs industrial comparisons, 6 mid-sized bank data points)
- Training epochs: 5
- Max context length: 4096
- Batch size: 4096
- Learning rate: 0.0002
- Training cost: ~$0.03, taking 30 minutes

### Evaluation Settings
- Test set: 5 untrained banks (C, HBAN, WBS, UMBF, INDB)
- Temperature: 0.0 (deterministic)
- Comparison models: GPT-5.5, Claude Opus 4.7
- System prompt: neutral prompt "You are a helpful financial analyst"

## Evidence: Evaluation Results and Key Indicator Comparison

Experimental results show that the fine-tuned model performs better in cost and multiple key dimensions:

| Indicator | Fine-tuned Llama 3.1 8B | GPT-5.5 | Claude Opus 4.7 |
|-----------|------------------------|---------|------------------|
| Average composite score | 77.1 | 83.4 | 87.0 |
| Industrial indicator misuse rate | 20% | 40% | 40% |
| Tier-3 source citation rate | 100% | 80% | 80% |
| Hallucinations | 0 | 0 | 3 |
| Score variance | 21 | 55 | 33 |
| Single inference cost | $0.00009 | $0.0894 | $0.1058 |
| Cost multiple | 1× | 994× | 1,176× |

The fine-tuned model won in 6 out of 9 evaluation dimensions, especially with 100% source citation accuracy and no hallucination issues.

## Technical Details and Lessons Learned

### Evaluation Dimensions
Includes 5 FIG analyst-level criteria: Format correctness (25 points), Numerical rationality (25 points), Subcategory awareness (20 points), Citation quality (15 points), Format completeness (15 points).

### Iteration Process
- v1: Default parameters showed no learning effect; v2: Corrected evaluation method; v3: Adjusted hyperparameters to increase citation rate to 53%; v4: After adding data, misuse rate dropped to 20% and citation rate reached 100%.

### Key Lessons
1. Fireworks default parameters are not suitable for small datasets; 2. Loss curve needs to drop below 1.0 to be effective; 3. Evaluation methods must be objective; 4. Keep test sets to avoid overfitting; 5. Use temperature=0.0 for deterministic tasks; 6. Scoring criteria need to be context-aware; 7. Cutting-edge API parameters need testing; 8. Cost advantage is the core GTM point.

## Limitations Note

The experiment has four limitations: 1. Average quality gap: fine-tuned model (77 points) vs Claude Opus 4.7 (87 points); 2. Cutting-edge models will improve in common vertical domains; 3. Small test sample size (N=5); 4. Low format completeness score (4.4/15).

## Business Value and Replicable Strategies

### Core Value
By diving deep into vertical workflows, identifying weaknesses of cutting-edge models, and achieving cost-quality balance via small-scale fine-tuning, this approach can be replicated across multiple domains:

| Vertical Domain | Workflow Gap | High-Volume Scenario |
|-----------------|--------------|----------------------|
| Banking/Capital Markets | Comparison tables, transaction screening | Sell-side analysts perform thousands of comparisons monthly |
| Medical Claims | Denial code disambiguation | Millions of claims processed daily |
| Legal | Contract clause classification | Hundreds of contract reviews weekly |
| Logistics | Invoice parsing | 10,000+ documents processed daily |
| Insurance | Policy review | Thousands of underwriting checks daily |

This method is revolutionary for high-volume vertical workloads where cost differences determine feasibility.
