Zing Forum

Reading

How Narrative Temporal Structure Affects the Causal Reasoning Ability of Large Language Models

A comparative study explores how the temporal presentation of narratives (sequential vs. non-sequential) affects causal understanding in humans and LLMs, and has open-sourced the complete code for computational and behavioral experiments.

因果推理大语言模型叙事结构时间顺序认知评估开源研究
Published 2026-04-27 21:36Recent activity 2026-04-27 21:49Estimated read 7 min
How Narrative Temporal Structure Affects the Causal Reasoning Ability of Large Language Models
1

Section 01

Introduction / Main Post: How Narrative Temporal Structure Affects the Causal Reasoning Ability of Large Language Models

A comparative study explores how the temporal presentation of narratives (sequential vs. non-sequential) affects causal understanding in humans and LLMs, and has open-sourced the complete code for computational and behavioral experiments.

2

Section 02

Research Background and Core Questions

Causal reasoning is one of the core abilities of human cognition and an important measure of the intelligence level of large language models (LLMs). However, information in the real world rarely presents itself in a perfect causal order—news reports may describe the result first before tracing the cause, witnesses in court often recall events in a jumpy manner, and novels and films frequently use non-linear narrative techniques such as flashbacks and intercuts.

This raises a key question: When information is presented in a non-temporal order, can LLMs still accurately understand the causal relationships between events? How do their performances compare to humans? A latest open-source research project "LLM-Causal-Reasoning" is attempting to answer these questions through rigorous experimental design.

3

Section 03

Experimental Design: Dual Path Exploration

The project adopts a unique dual experimental design, approaching from both computational models and human subjects, using exactly the same narrative materials for comparative research.

4

Section 04

Three Carefully Designed Narrative Scenarios

The research team constructed three causal chain scenarios in different domains, each containing 8 interrelated events:

Medical Scenario: Ventilator failure incident in a hospital ward— involving multiple causal factors such as insufficient staffing, equipment aging, and delayed maintenance

Workplace Scenario: System downtime caused by failed server configuration changes— demonstrating the complex causal network between technical decisions, communication errors, and emergency responses

Coastal Scenario: Flood disaster during floodgate construction— integrating elements such as project progress, weather conditions, and risk management

5

Section 05

Two Narrative Presentation Methods

Each scenario has two versions:

  • Linear Version: Presents events in chronological order, conforming to the natural causal reading experience
  • Non-linear Version: Presents events in a打乱 chronological order, simulating the common non-sequential information reception scenarios in reality

In addition, the coastal scenario also specially designed a "high-noise version", adding a large amount of irrelevant filler text to test the robustness of models and humans against interference information.

6

Section 06

Computational Experiment: Causal Graph Construction of LLMs

In the computational experiment part, researchers had large language models receive story fragments incrementally and gradually construct causal event graphs. This process simulates the information integration process of humans when reading long texts.

7

Section 07

Incremental Construction and Revision Mechanism

The core process of the experiment is divided into two stages:

  1. Incremental Construction Stage: The model receives story fragments one by one, and updates its internal causal graph representation each time it receives a fragment. This requires the model to not only understand the content of the current fragment but also integrate it with the received information.

  2. Revision Stage: When all fragments are received, the model gets a chance to review and revise. The design inspiration for this stage comes from the human reading comprehension process—we often adjust our previous understanding after reading the entire story.

8

Section 08

Evaluation Metric System

To quantitatively evaluate the causal reasoning ability of models, the research team designed a multi-dimensional evaluation metric system:

Causal Edge F1 Score (Strict Matching): Accurately measures the matching degree between the causal relationships identified by the model and the manually annotated standard answers, requiring the event descriptions to be completely consistent to be considered correct.

Causal Edge F1 Score (Loose Matching): Allows partial description matching, better reflecting the model's semantic-level understanding ability rather than just focusing on the overlap of surface text.

Pairwise Event Ordering Accuracy: Evaluates the model's ability to grasp the chronological order of events, which is the foundation of causal reasoning.

Time Label Accuracy: Tests the model's judgment accuracy of the "time_to_next" labels (immediate/short-term/medium-term/long-term), which reflects the model's understanding of causal time scales.