Zing Forum

Reading

InteractComp: A Systematic Evaluation Framework for Interactive Reasoning Capabilities of Large Language Models

This article introduces InteractComp, an evaluation framework specifically designed to assess the interactive reasoning capabilities of large language models. It supports multiple interaction modes and includes a built-in ReAct-style agent, providing a standardized tool for the systematic analysis of model decision-making abilities.

大语言模型评测交互推理ReActAgent异步评估工具使用多轮对话决策能力基准测试AI框架
Published 2026-05-04 05:15Recent activity 2026-05-04 05:50Estimated read 8 min
InteractComp: A Systematic Evaluation Framework for Interactive Reasoning Capabilities of Large Language Models
1

Section 01

Introduction: InteractComp—A Systematic Evaluation Framework for Interactive Reasoning Capabilities of Large Language Models

This article introduces InteractComp, an evaluation framework specifically designed to assess the interactive reasoning capabilities of large language models. It supports multiple interaction modes, includes a built-in ReAct-style agent, and provides an asynchronous evaluation pipeline, offering a standardized tool for the systematic analysis of model decision-making abilities. It fills the gap where traditional single-turn question-answering benchmarks fail to evaluate interactive reasoning capabilities.

2

Section 02

Background: Interactive Reasoning—A New Dimension of Large Model Capabilities

Large language models have performed close to or exceeded humans in static question-answering tasks, but real-world problems often require multi-turn interactions to solve. Interactive reasoning capabilities demand that models actively search when information is insufficient, clarify questions when understanding is ambiguous, and dynamically adjust strategies—abilities that traditional single-turn question-answering benchmarks struggle to evaluate. The InteractComp project was born to fill this gap.

3

Section 03

Methodology: Core Design of the InteractComp Framework

ReAct-style Agent

The framework includes a reusable ReAct agent that closely integrates reasoning (Thought) and action (Action), explicitly outputting thinking processes and action instructions to help evaluators understand decision-making logic.

Multi-action Support

Covers 6 interaction modes: pure answer, pure search, pure question, full mode, full mode with context, and forced question mode, with fine-grained control to isolate and evaluate specific capabilities.

Asynchronous Evaluation Pipeline

Built on asyncio, the asynchronous orchestration system supports simultaneous evaluation of multiple models, significantly reducing evaluation time caused by API call bottlenecks and improving experimental efficiency.

4

Section 04

Application Scenarios: Typical Use Cases of InteractComp

  • Model Capability Diagnosis: Compare performance across different action modes to identify capability gaps (e.g., excellent pure answer performance but poor search mode indicates a lack of ability to use external information).
  • Interactive Strategy Optimization: Test different strategies (e.g., search first then ask questions) to find the decision-making process suitable for the scenario.
  • Multi-model Comparison: The standardized interface supports comparing the performance of models like GPT-4 and Claude on the same tasks, generating reproducible reports.
  • Prompt Engineering Validation: Quantify the impact of different prompt designs on interactive reasoning effects for systematic optimization.
5

Section 05

Technical Implementation: Modular Design and Core Components

The framework adopts a modular design, with core components including:

  • Action Executor: Calls search APIs, handles user input, and other external interactions.
  • State Manager: Maintains context information such as conversation history and intermediate results.
  • Evaluator: Judges whether outputs are correct based on task definitions.
  • Metric Calculator: Aggregates metrics like accuracy, number of interaction turns, and search frequency. The modular design facilitates the addition of new actions or integration of new models.
6

Section 06

Usage: Concise API and Multi-dimensional Evaluation Reports

Usage Steps: Define the evaluation task (initial question, expected answer, available tools) → Configure the model to be tested and evaluation mode → Start the evaluation process. The framework automatically records interaction logs.

Evaluation Report Dimensions:

  • Success Rate: The proportion of correctly solved problems
  • Average Number of Interaction Turns: Reflects decision-making efficiency
  • Tool Usage Distribution: Frequency of actions like search and question
  • Error Type Analysis: Classification of insufficient knowledge, reasoning errors, tool misuse, etc.
7

Section 07

Contributions and Future: Directions for Improving Large Model Evaluation Systems

Contributions to Evaluation Systems

Existing benchmarks (e.g., MMLU, HumanEval) focus on static knowledge and single-turn reasoning. InteractComp fills the gap in evaluating multi-turn interaction and tool usage capabilities. Its open-source release provides a standardized tool for academia and industry, helping to build a more comprehensive evaluation system.

Future Development Directions

  • Multi-agent Interaction: Evaluate performance in collaborative scenarios
  • Long-term Task Planning: Test the ability to plan long-cycle tasks
  • User Simulation: Use large models to simulate real users and test interaction naturalness
  • Adversarial Evaluation: Design ambiguous tasks to test robustness InteractComp provides the basic architecture for these extensions.