Reading

When to Trust Tools? An Adaptive Tool Trust Calibration Method for Tool-Integrated Mathematical Reasoning

This article introduces the ATTC framework, which uses code block confidence scores to guide models to adaptively choose to trust or ignore tool results, effectively solving the "tool neglect" problem in tool-integrated reasoning and improving performance by 4.1% to 7.5%.

工具集成推理大语言模型数学推理置信度校准工具调用自适应学习

Published 2026-04-09 22:14Recent activity 2026-04-10 10:46Estimated read 5 min

When to Trust Tools? An Adaptive Tool Trust Calibration Method for Tool-Integrated Mathematical Reasoning

Section 01

[Main Floor] When to Trust Tools? The ATTC Framework Solves the Tool Neglect Problem in Tool-Integrated Reasoning

This article addresses the "tool neglect" problem in Tool-Integrated Reasoning (TIR), where models often ignore correct tool results, and proposes the Adaptive Tool Trust Calibration (ATTC) framework. This framework uses code block confidence scores to guide models to adaptively choose to trust or ignore tool results, effectively alleviating the tool neglect phenomenon and achieving a performance improvement of 4.1% to 7.5% across multiple models and datasets.

Section 02

[Background] The Rise and Hidden Concerns of Tool-Integrated Reasoning: Models Don't Know When to Trust Tools

With the development of Large Reasoning Models (LRMs), Tool-Integrated Reasoning (TIR) has become an important paradigm to break through the limitations of purely parametric reasoning, allowing models to call external tools (such as Python, SQL) to obtain accurate results. However, existing TIR models have the "tool neglect" problem: when their own reasoning conflicts with tool results, models often stick to their own opinions and even actively ignore correct tool outputs. This stems from the fact that training does not explicitly teach models to evaluate and integrate tool results, leading to tool integration becoming a superficial formality.

Section 03

[Method] The ATTC Framework: An Adaptive Trust Calibration Mechanism Based on Code Confidence

The core of the ATTC framework is a dynamic decision-making mechanism based on code block confidence:

Confidence Estimation Module: Calculates the confidence score of each generated code block, reflecting the model's degree of certainty in tool calls;
Dynamic Trust Decision: Adopts tool results when confidence is high, and relies on internal reasoning when confidence is low;
Calibration Learning Mechanism: Establishes a mapping between confidence and tool reliability through a dedicated training objective. In implementation, ATTC modifies the loss function: it penalizes the behavior of ignoring correct tool results, strengthens correct trust decisions, and integrates into the existing TIR training process.

Section 04

[Evidence] Experimental Verification: ATTC Significantly Alleviates Tool Neglect, with Performance Improvements of 4.1%-7.5%

Experimental verification shows that ATTC has significant effects:

Alleviates Tool Neglect: The cases where models ignore correct tool results are greatly reduced;
Performance Improvement: Performance increases by 4.1% to 7.5% across different model sizes and datasets;
Good Generalization: Stable improvements across model architectures and datasets. In the case study, the baseline model called the tool but ignored the result, while after ATTC training, it could correctly trust the tool output and give accurate answers.

Section 05

[Conclusion and Recommendations] Technical Insights and Future Directions of ATTC

ATTC brings technical insights:

Metacognitive Ability: Tool integration requires cultivating models' metacognition to evaluate tool reliability;
Value of Confidence: Code confidence can be extended as a decision signal to other scenarios;
Adaptive Decision-Making: Dynamically adjusting behavior is more robust than fixed rules. Future directions can further explore the multi-dimensional applications of confidence. The conclusion points out that ATTC provides a solution for balancing autonomous reasoning and external assistance, and will lead subsequent research on tool-integrated reasoning.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15