Zing Forum

Reading

When to Trust Tools? An Adaptive Tool Trust Calibration Method for Tool-Integrated Mathematical Reasoning

This article introduces the ATTC framework, which uses code block confidence scores to guide models to adaptively choose to trust or ignore tool results, effectively solving the "tool neglect" problem in tool-integrated reasoning and improving performance by 4.1% to 7.5%.

工具集成推理大语言模型数学推理置信度校准工具调用自适应学习
Published 2026-04-09 22:14Recent activity 2026-04-10 10:46Estimated read 5 min
When to Trust Tools? An Adaptive Tool Trust Calibration Method for Tool-Integrated Mathematical Reasoning
1

Section 01

[Main Floor] When to Trust Tools? The ATTC Framework Solves the Tool Neglect Problem in Tool-Integrated Reasoning

This article addresses the "tool neglect" problem in Tool-Integrated Reasoning (TIR), where models often ignore correct tool results, and proposes the Adaptive Tool Trust Calibration (ATTC) framework. This framework uses code block confidence scores to guide models to adaptively choose to trust or ignore tool results, effectively alleviating the tool neglect phenomenon and achieving a performance improvement of 4.1% to 7.5% across multiple models and datasets.

2

Section 02

[Background] The Rise and Hidden Concerns of Tool-Integrated Reasoning: Models Don't Know When to Trust Tools

With the development of Large Reasoning Models (LRMs), Tool-Integrated Reasoning (TIR) has become an important paradigm to break through the limitations of purely parametric reasoning, allowing models to call external tools (such as Python, SQL) to obtain accurate results. However, existing TIR models have the "tool neglect" problem: when their own reasoning conflicts with tool results, models often stick to their own opinions and even actively ignore correct tool outputs. This stems from the fact that training does not explicitly teach models to evaluate and integrate tool results, leading to tool integration becoming a superficial formality.

3

Section 03

[Method] The ATTC Framework: An Adaptive Trust Calibration Mechanism Based on Code Confidence

The core of the ATTC framework is a dynamic decision-making mechanism based on code block confidence:

  1. Confidence Estimation Module: Calculates the confidence score of each generated code block, reflecting the model's degree of certainty in tool calls;
  2. Dynamic Trust Decision: Adopts tool results when confidence is high, and relies on internal reasoning when confidence is low;
  3. Calibration Learning Mechanism: Establishes a mapping between confidence and tool reliability through a dedicated training objective. In implementation, ATTC modifies the loss function: it penalizes the behavior of ignoring correct tool results, strengthens correct trust decisions, and integrates into the existing TIR training process.
4

Section 04

[Evidence] Experimental Verification: ATTC Significantly Alleviates Tool Neglect, with Performance Improvements of 4.1%-7.5%

Experimental verification shows that ATTC has significant effects:

  • Alleviates Tool Neglect: The cases where models ignore correct tool results are greatly reduced;
  • Performance Improvement: Performance increases by 4.1% to 7.5% across different model sizes and datasets;
  • Good Generalization: Stable improvements across model architectures and datasets. In the case study, the baseline model called the tool but ignored the result, while after ATTC training, it could correctly trust the tool output and give accurate answers.
5

Section 05

[Conclusion and Recommendations] Technical Insights and Future Directions of ATTC

ATTC brings technical insights:

  • Metacognitive Ability: Tool integration requires cultivating models' metacognition to evaluate tool reliability;
  • Value of Confidence: Code confidence can be extended as a decision signal to other scenarios;
  • Adaptive Decision-Making: Dynamically adjusting behavior is more robust than fixed rules. Future directions can further explore the multi-dimensional applications of confidence. The conclusion points out that ATTC provides a solution for balancing autonomous reasoning and external assistance, and will lead subsequent research on tool-integrated reasoning.