Zing Forum

Reading

Forge: A Multi-Agent Framework Enabling 99% Tool Call Success Rate for 8B Small Models

Forge is a Python framework that increases the multi-step tool call success rate of 8B-parameter models from 38% to 99% through reliability layers, guardrail mechanisms, and context management, supporting Ollama, llama-server, and Anthropic backends.

LLMtool-callingagentic-workflowsOllamaself-hostedPythonguardrailscontext-managementmulti-step-reasoning
Published 2026-04-02 12:42Recent activity 2026-04-02 12:48Estimated read 6 min
Forge: A Multi-Agent Framework Enabling 99% Tool Call Success Rate for 8B Small Models
1

Section 01

[Introduction] Forge Framework: A Multi-Agent Solution Enabling 99% Tool Call Success Rate for 8B Small Models

Introducing the core value of the Forge framework: as a Python framework, it increases the multi-step tool call success rate of 8B-parameter models from 38% to 99% through reliability layers, guardrail mechanisms, and context management. It supports backends like Ollama, llama-server, and Anthropic, addressing the pain points of insufficient tool call performance in open-source small models and high costs of closed-source large models.

2

Section 02

Project Background and Core Positioning

Background: LLM tool call capability is crucial, but closed-source models at the GPT-4 level are costly, and open-source small models perform poorly in multi-step workflows. Forge's positioning: a Python framework designed specifically for self-hosted LLMs, focusing on tool calls and multi-step agent workflows. Unlike comprehensive frameworks like LangChain, its core keywords are reliability, lightweightness, and flexibility. It enhances existing model capabilities through guardrails and context management rather than replacing them.

3

Section 03

Core Mechanism: Three-Layer Reliability Architecture

Forge improves reliability through three layers of mechanisms: 1. Response Parsing and Rescue: Automatically fixes format errors (e.g., bracket/quote issues); if it fails, it provides step-by-step feedback to guide regeneration. 2. Step Enforcement and Retry Guidance: The required_steps mechanism enforces the sequence of steps; when steps are missing, it constructs prompts to inform the gaps and suggest the next tool. 3. Context Management and Intelligent Compression: The TieredCompact of ContextManager compresses history in layers and supports VRAM-aware dynamic context adjustment.

4

Section 04

Three Usage Modes: Adapting to Different Scenario Needs

Forge offers three modes: 1. WorkflowRunner: Full integration, managing tool sets and workflow lifecycles, supporting multi-agent collaboration (SlotWorker component). 2. Middleware Mode: Non-intrusively embedded into existing projects, responsible for response validation, format rescue, and step enforcement. 3. Proxy Server Mode: OpenAI-compatible, transparently applying guardrail mechanisms, automatically injecting the respond tool to eliminate ambiguity between text and tool calls, adapting to clients like Continue and aider.

5

Section 05

Backend Support and Model Selection Recommendations

Supported backends: Ollama (easy to use, suitable for prototypes), llama-server (best performance, production environment), Llamafile (zero-dependency deployment), Anthropic API (cloud comparison). Model recommendations: For the 8B scale, the Mistral3 series is recommended (e.g., ministral-3:8b-instruct-2512-q4_K_M), and quantized versions balance accuracy and VRAM usage.

6

Section 06

Evaluation System and Performance Verification Results

The built-in evaluation system includes 31 test scenarios (18 with batch results), supporting single/batch evaluations and report generation. Data shows: Without Forge, the multi-step tool call success rate of 8B models is about 38% (failure reasons: format errors, missing steps, context loss); after enabling full guardrails, it increases to about 99%, a more than 2.5x improvement.

7

Section 07

Practical Applications and Ecosystem Integration

Forge is compatible with the existing ecosystem: The proxy mode can seamlessly integrate with OpenAI API clients such as the VS Code Continue plugin and aider terminal tool. Long-term session recommendations: Filter transient messages to improve context efficiency, suitable for scenarios like CLI assistants, chat servers, and voice assistants.

8

Section 08

Summary and Future Outlook

Forge's pragmatic approach: Unleash the potential of existing small models through engineering optimization rather than pursuing large parameters, providing reliable agent capabilities for resource-constrained developers, privacy-focused enterprises, and offline deployment scenarios. Outlook: Expand to multi-modal models and complex agent architectures, becoming the infrastructure for next-generation AI applications.