To quantitatively evaluate the causal reasoning ability of models, the research team designed a multi-dimensional evaluation metric system:
Causal Edge F1 Score (Strict Matching): Accurately measures the matching degree between the causal relationships identified by the model and the manually annotated standard answers, requiring the event descriptions to be completely consistent to be considered correct.
Causal Edge F1 Score (Loose Matching): Allows partial description matching, better reflecting the model's semantic-level understanding ability rather than just focusing on the overlap of surface text.
Pairwise Event Ordering Accuracy: Evaluates the model's ability to grasp the chronological order of events, which is the foundation of causal reasoning.
Time Label Accuracy: Tests the model's judgment accuracy of the "time_to_next" labels (immediate/short-term/medium-term/long-term), which reflects the model's understanding of causal time scales.