Zing Forum

Reading

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

The research team proposes an agentic automata learning framework to evaluate the ability of tool-using LLM agents to discover hidden environments through interaction. Experiments show that while current LLM agents can perform non-trivial interactive discovery, they have systematic flaws in query planning, evidence integration, and hypothesis construction, and are far less effective than classical algorithms.

世界模型推断智能体自动机学习LLM智能体确定性有限自动机交互式发现查询规划证据整合假设构建
Published 2026-06-15 19:23Recent activity 2026-06-16 11:03Estimated read 13 min
Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning
1

Section 01

Introduction: Evaluating the World Model Inference Ability of LLM Agents—Evidence from Agentic Automata Learning

Original Author/Team: Agent Reasoning and Automata Theory Research Team Source Platform: arXiv Original Title: Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning Publication Date: 2026-06-15 Original Link: http://arxiv.org/abs/2606.16576v1

Core Insights: The research team proposes an Agentic Automata Learning Framework, using Deterministic Finite Automata (DFA) as the hidden environment to evaluate the ability of LLM agents to infer the environment's structure through interaction. Experiments show that while current LLM agents can perform non-trivial interactive discovery, they have systematic flaws in query planning, evidence integration, and hypothesis construction, and their performance is far inferior to classical automata learning algorithms (e.g., the L* algorithm).

2

Section 02

Research Background: Core Issues of World Model Inference and Limitations of Existing Evaluations

Concept of World Model

In cognitive science and AI, a "world model" refers to an agent's representation of the environment's internal operating mechanisms. Agents with accurate world models can achieve four key abilities: prediction, planning, generalization, and explanation. For tool-using LLM agents, this ability is crucial for effectively utilizing external tools.

Limitations of Existing Evaluations

Current LLM evaluations mostly focus on task completion rates and have three major shortcomings:

  1. Result-oriented: Only focuses on final task completion, ignoring environmental understanding during the process;
  2. Superficial behavior: May complete tasks through pattern matching or memory, lacking deep understanding;
  3. Weak generalization: Performs well on specific tasks but drops sharply in similar environments. Thus, a more rigorous framework is needed to directly test world model inference ability.
3

Section 03

Evaluation Method: Detailed Explanation of the Agentic Automata Learning Framework

Core Idea

The framework is inspired by classical automata learning theory: agents need to infer the structure of an unknown Deterministic Finite Automata (DFA) through interaction, with evaluation criteria being learning efficiency and inference accuracy.

Reasons for Choosing DFA

  1. Interpretability: Clear structure, easy to verify the agent's level of understanding;
  2. Controllable complexity: Adjust task difficulty via the number of states/transitions;
  3. Strong baselines: Existence of classical benchmarks like the L* algorithm;
  4. Wide applicability: Can represent various real-world system behavior patterns.

Interaction Protocol

  • Membership Query: The agent asks whether a string belongs to the target language; the Oracle answers "yes/no" to obtain information about the DFA's acceptance/rejection behavior;
  • Equivalence Query: The agent submits a hypothetical DFA; the Oracle verifies it and provides a counterexample (if the hypothesis is wrong) to promote hypothesis correction.

Evaluation Dimensions

  1. Query efficiency: Number of queries needed to learn the DFA;
  2. Hypothesis quality: Similarity between the inferred DFA and the real DFA;
  3. Learning success rate: Proportion of successful learning within a given budget;
  4. Interaction strategy: Effectiveness of query planning, evidence integration, and hypothesis construction.
4

Section 04

Experimental Evidence: Performance of LLM Agents

Experimental Setup

  • Test models: Reasoning models (e.g., OpenAI o1/o3), non-reasoning models (e.g., GPT-4, Claude);
  • DFA complexity: Small (3-5 states), medium (6-10 states), large (11-15 states);
  • Comparison baselines: Classical L* algorithm (theoretically optimal), random strategy.

Core Findings

  1. Performance decreases with complexity: Most models can successfully learn small DFAs; success rates drop significantly for medium DFAs; large DFAs are almost impossible to complete;
  2. Advantage of reasoning models: Reasoning models outperform non-reasoning models in success rate, query efficiency, and hypothesis quality on complex DFAs—explicit reasoning ability is crucial for world model inference.
5

Section 05

Failure Mode Analysis: Flaws in Query Planning, Evidence Integration, and Hypothesis Construction

Query Planning Failure

  • Problem manifestations: Repeating queries for functionally equivalent strings, choosing non-informative queries, lacking systematic strategies (e.g., binary search);
  • Comparison with classical algorithms: The L* algorithm prioritizes queries with the highest information gain, systematically explores the state space, and avoids redundancy.

Evidence Integration Failure

  • Problem manifestations: Forgetting early query results, inability to reconcile conflicting evidence, over/under generalization;
  • Case example: Knowing that "ab" is accepted and "aba" is rejected but failing to correctly reflect the boundary.

Hypothesis Construction Failure

  • Problem manifestations: Premature convergence when evidence is insufficient, completely discarding hypotheses after receiving counterexamples, incorrect DFA structure (missing states/transitions);
  • Cognitive biases: Similar to human confirmation bias, anchoring effect, and availability heuristic.
6

Section 06

Comparison with Classical Algorithms: Gaps and Advantages of LLM Agents

Query Efficiency

  • L algorithm*: Number of queries has a polynomial relationship with DFA size;
  • LLM agents: Number of queries grows exponentially with DFA size.

Success Rate

  • L algorithm*: Always succeeds in learning under theoretical guarantees;
  • LLM agents: Success rate drops significantly on complex DFAs.

Robustness

  • L algorithm*: Robust to initial conditions and noise;
  • LLM agents: Performance fluctuates greatly, significantly affected by prompts and randomness.

Interpretability Advantage

LLM agents can explain query strategies, reasoning processes, and hypotheses through natural language, which helps diagnose failure causes; in contrast, the internal mechanisms of classical algorithms are less transparent.

7

Section 07

Research Significance and Improvement Directions

Ability Boundaries

  • Can perform non-trivial interactive discovery, but the ability to infer complex environments is far inferior to specialized algorithms;
  • Explicit reasoning ability is key to improving world model inference.

Improvement Directions

  1. Query strategy optimization: Learn efficient query planning;
  2. Memory enhancement: Improve evidence integration mechanisms to avoid information loss;
  3. Hypothesis management: Optimize hypothesis generation and correction;
  4. Metacognitive ability: Evaluate one's own uncertainty and obtain more information as needed.

Implications for Evaluation Paradigms

  • Go beyond task completion rates and focus on the depth of environmental understanding;
  • Use synthetic environments with controllable complexity for systematic evaluation;
  • Compare with theoretically optimal algorithms to clarify gaps.
8

Section 08

Limitations and Future Work

Current Limitations

  1. Simplified environment: DFA has a gap with the complexity of the real world;
  2. Oracle assumption: Assumes the Oracle is always correct, but noise may exist in reality;
  3. Single task: Only evaluates automata learning, not other types of world model inference.

Future Work

  1. Complex environment expansion: Probabilistic automata, partially observable environments, etc.;
  2. Real-world scenario applications: API learning, database schema inference, etc.;
  3. Algorithm fusion: Combine the generality of LLM with the rigor of classical algorithms;
  4. Theoretical analysis: Establish theoretical bounds for the world model inference ability of LLMs.

Conclusion

The agentic automata learning framework provides a rigorous tool for evaluating the world model inference ability of LLMs. While current LLMs have potential, they still need improvement in deep structural understanding and systematic reasoning. In the future, we need to combine the advantages of LLMs with classical algorithms to develop stronger world model inference capabilities.