Reading

Practical Vertical Fine-Tuning: Using 37 Data Points to Make Llama 3.1 8B Outperform Cutting-Edge Models in Banking Business Analysis

A 5-day vertical fine-tuning demonstration showing how to use 37 manually curated training data points on Fireworks AI to reduce the cost of Llama 3.1 8B for bank comparable company analysis tasks by 1000x, while maintaining quality levels competitive with GPT-5.5 and Claude Opus 4.7.

大语言模型垂直微调LoRAFireworks AI金融领域成本优化Llama 3.1模型评估

Published 2026-05-11 03:44Recent activity 2026-05-11 03:50Estimated read 7 min

Practical Vertical Fine-Tuning: Using 37 Data Points to Make Llama 3.1 8B Outperform Cutting-Edge Models in Banking Business Analysis

Section 01

Main Floor: Cost and Quality Breakthroughs in Vertical Fine-Tuning Llama 3.1 8B for Banking Business Analysis

A 5-day experimental demonstration shows that by vertically fine-tuning Llama 3.1 8B with 37 manually curated training data points on the Fireworks AI platform, it can achieve a 1000x cost reduction for bank comparable company analysis tasks while maintaining quality levels competitive with GPT-5.5 and Claude Opus 4.7. Key finding: After careful vertical fine-tuning, open-source models can match cutting-edge closed-source models in domain-specific tasks, with inference costs reduced to 1/1000 of the latter.

Section 02

Background: Core Requirements of Comparable Company Analysis and Pain Points of Cutting-Edge Models

Comparable company analysis is a daily task in the financial industry, requiring: 1) Correct valuation multiples (e.g., P/E, P/TBV for banks, avoiding industrial indicators); 2) Real data (no placeholders or estimates); 3) Clear source citations. However, experiments found that cutting-edge models cannot meet all three requirements simultaneously under production API settings (temperature=0.0, neutral prompts).

Section 03

Experimental Methods: Model Selection and Training/Evaluation Configuration

Model and Training Configuration

Base model: Llama 3.1 8B
Training method: Supervised Fine-Tuning (SFT) + LoRA (rank 16)
Training data: 37 manually curated examples (26 bank comparison tables, 5 FIG vs industrial comparisons, 6 mid-sized bank data points)
Training epochs: 5
Max context length: 4096
Batch size: 4096
Learning rate: 0.0002
Training cost: ~$0.03, taking 30 minutes

Evaluation Settings

Test set: 5 untrained banks (C, HBAN, WBS, UMBF, INDB)
Temperature: 0.0 (deterministic)
Comparison models: GPT-5.5, Claude Opus 4.7
System prompt: neutral prompt "You are a helpful financial analyst"

Section 04

Evidence: Evaluation Results and Key Indicator Comparison

Experimental results show that the fine-tuned model performs better in cost and multiple key dimensions:

Indicator	Fine-tuned Llama 3.1 8B	GPT-5.5	Claude Opus 4.7
Average composite score	77.1	83.4	87.0
Industrial indicator misuse rate	20%	40%	40%
Tier-3 source citation rate	100%	80%	80%
Hallucinations	0	0	3
Score variance	21	55	33
Single inference cost	$0.00009	$0.0894	$0.1058
Cost multiple	1×	994×	1,176×

The fine-tuned model won in 6 out of 9 evaluation dimensions, especially with 100% source citation accuracy and no hallucination issues.

Section 05

Technical Details and Lessons Learned

Evaluation Dimensions

Includes 5 FIG analyst-level criteria: Format correctness (25 points), Numerical rationality (25 points), Subcategory awareness (20 points), Citation quality (15 points), Format completeness (15 points).

Iteration Process

v1: Default parameters showed no learning effect; v2: Corrected evaluation method; v3: Adjusted hyperparameters to increase citation rate to 53%; v4: After adding data, misuse rate dropped to 20% and citation rate reached 100%.

Key Lessons

Fireworks default parameters are not suitable for small datasets; 2. Loss curve needs to drop below 1.0 to be effective; 3. Evaluation methods must be objective; 4. Keep test sets to avoid overfitting; 5. Use temperature=0.0 for deterministic tasks; 6. Scoring criteria need to be context-aware; 7. Cutting-edge API parameters need testing; 8. Cost advantage is the core GTM point.

Section 06

Limitations Note

The experiment has four limitations: 1. Average quality gap: fine-tuned model (77 points) vs Claude Opus 4.7 (87 points); 2. Cutting-edge models will improve in common vertical domains; 3. Small test sample size (N=5); 4. Low format completeness score (4.4/15).

Section 07

Business Value and Replicable Strategies

Core Value

By diving deep into vertical workflows, identifying weaknesses of cutting-edge models, and achieving cost-quality balance via small-scale fine-tuning, this approach can be replicated across multiple domains:

Vertical Domain	Workflow Gap	High-Volume Scenario
Banking/Capital Markets	Comparison tables, transaction screening	Sell-side analysts perform thousands of comparisons monthly
Medical Claims	Denial code disambiguation	Millions of claims processed daily
Legal	Contract clause classification	Hundreds of contract reviews weekly
Logistics	Invoice parsing	10,000+ documents processed daily
Insurance	Policy review	Thousands of underwriting checks daily

This method is revolutionary for high-volume vertical workloads where cost differences determine feasibility.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15