Zing Forum

Reading

EDGAR: A New Dataset for Automatically Extracting Geopolitical Events Using Large Language Models

The EDGAR dataset released by the HCSS Data Lab uses large language models to automatically extract geopolitical events from English news, adopts the PLOVER ontology, and extends trilateral roles to provide structured event data for international relations research.

EDGAR地缘政治事件数据大语言模型PLOVER国际关系数据集自动化提取
Published 2026-05-12 16:53Recent activity 2026-05-12 17:00Estimated read 6 min
EDGAR: A New Dataset for Automatically Extracting Geopolitical Events Using Large Language Models
1

Section 01

EDGAR Dataset Guide: A Large Language Model-Driven Tool for Automated Extraction of Geopolitical Events

The EDGAR dataset released by the HCSS Data Lab uses large language models to automatically extract geopolitical events from English news. It adopts 16 event types defined by the PLOVER ontology, extends trilateral roles to capture multilateral interactions, provides structured event data for international relations research, and uses the CC BY 4.0 open license to support academic research and secondary development.

2

Section 02

Research Background: Needs and Trends in Automation of Geopolitical Event Data

In fields like international relations, structured event data is the foundation of quantitative analysis, but traditional manual coding is inefficient and difficult to scale. With the improvement of large language model capabilities, automated event extraction has become possible, and the EDGAR project of the HCSS Data Lab is a representative achievement of this trend.

3

Section 03

Core Features of EDGAR and 16 Event Types of the PLOVER Ontology

The core features of EDGAR (Event Dataset using Geopolitical Analysis and Retrieval) include: automated extraction, standardized PLOVER ontology (16 root event types divided into four categories: verbal cooperation, material cooperation, verbal conflict, material conflict), extended trilateral roles, and open license. The 16 event types of PLOVER such as CONSULT (consultation), AGREE (agreement), AID (aid), ASSAULT (assault), etc., can quickly identify the nature of events.

4

Section 04

Innovative Trilateral Role Model: Breaking the Limitations of Traditional Binary Interaction

Traditional event datasets mostly use a binary structure of 'actor-recipient'. EDGAR introduces a 'third party' role, which can capture more complex multilateral interactions. For example, in the talks between the US and Russia on the Ukraine issue, the actor is the US, the recipient is Russia, and the third party is Ukraine. The ternary structure more accurately reflects the complexity of international relations.

5

Section 05

Data Format and Deduplication Mechanism: Ensuring Data Usability and Uniqueness

EDGAR provides two files: EDGAR_non_dedup.csv (before deduplication) and EDGAR_dedup.csv (after deduplication). Core fields include event_date, article_date, event_summary, source_quote, core_sentence, event_type, actor/recipient/third_party, etc. The deduplication mechanism uses multi-dimensional weighted calculations such as semantic similarity (all-MiniLM-L6-v2), actor Jaccard similarity, event type matching, metadata overlap, etc., followed by graph clustering to merge similar events.

6

Section 06

Application Scenarios and Limitations: Usage Value and Notes for EDGAR

EDGAR currently covers English news from February 1 to June 30, 2024. Application scenarios include conflict early warning (monitoring the frequency of events like ASSAULT), diplomatic relations analysis (tracking cooperation events), sanctions research (SANCTION time series), and multilateral relationship network construction. Limitations include: only covering English news, potential bias from single sources, errors in model-inferred fields, and short time range.

7

Section 07

Comparison with POLECAT and Conclusion: Significance and Future Prospects of EDGAR

Comparison between EDGAR and POLECAT: Extraction technology (LLM vs traditional NLP), role model (trilateral vs binary), event granularity (core events vs multiple related events). Conclusion: EDGAR is an important attempt in the transformation of geopolitical event data extraction towards automation and intelligence. Expanding data coverage and improving methodologies in the future will play a greater role in understanding global political dynamics.