Reading

UniEdit: A Unified Knowledge Editing Evaluation Benchmark for Large Language Models

UniEdit is a large-scale open-domain knowledge editing evaluation benchmark with 311,000 samples, covering 25 knowledge domains, which systematically evaluates knowledge editing algorithms from three dimensions: reliability, generalization, and locality.

知识编辑大语言模型评测基准NeurIPS知识更新模型编辑维基数据

Published 2026-05-05 17:15Recent activity 2026-05-05 17:22Estimated read 7 min

UniEdit: A Unified Knowledge Editing Evaluation Benchmark for Large Language Models

Section 01

UniEdit: Guide to the Unified Evaluation Benchmark for Knowledge Editing in Large Language Models

UniEdit is a unified knowledge editing evaluation benchmark for large language models, featuring 311,000 samples and covering 25 knowledge domains. It systematically evaluates knowledge editing algorithms from three dimensions: reliability, generalization, and locality. It addresses the limitations of existing benchmarks, such as narrow coverage, single structure, and incomplete evaluation criteria, providing a standardized evaluation tool for the field and promoting the development of knowledge editing technology.

Section 02

Background: Demand for Knowledge Editing and Limitations of Existing Benchmarks

The pre-trained knowledge of large language models has timeliness issues, and knowledge editing technology aims to accurately update internal knowledge without retraining the model. However, existing evaluation benchmarks have limitations such as narrow knowledge coverage, insufficient structural diversity, and incomplete evaluation criteria, making it difficult to comprehensively assess algorithm performance. To address this, the NeurIPS 2025 research team launched the UniEdit benchmark.

Section 03

Core Design of UniEdit: Scale, Domains, and Three Evaluation Dimensions

UniEdit is a large-scale open-domain benchmark with 311,000 samples, built from 29.9 million entities in Wikidata, covering 25 domains (across five major categories including natural sciences, humanities, and social sciences). Its core evaluation dimensions include:

Reliability: Whether the target fact can be answered correctly after editing
Generalization: Whether it can be extended to semantically equivalent expressions
Locality: Whether it only affects the target knowledge without interfering with irrelevant content

Section 04

Data Generation: NMCS Algorithm Facilitates Diversified Sample Construction

UniEdit uses the NMCS (Neighborhood Multi-hop Chain Sampling) algorithm to generate diversified samples, with the process as follows:

Sample structured fact chains from Wikidata
Convert to natural language using Deepseek-V3
Generate samples for various evaluation scenarios such as restatement and multi-hop reasoning. This method expands the evaluation coverage and improves the comprehensiveness of the assessment.

Section 05

Fine-grained Evaluation Dimensions and Open-source Dataset Structure

UniEdit supports 12 evaluation dimensions, including restatement, multi-hop reasoning, relation reversal, etc.:

Evaluation Dimension	Description
Restatement (Rep)	Different expressions of the same fact
Multi-hop Reasoning (MH)	Complex problems requiring multi-step reasoning
Relation Reversal (RR)	Ability to reason about inverse relationships
Same Entity Reasoning (SER)	Association of different attributes of the same entity
Subject Alias (SA)	Recognition of different names of an entity
Object Alias (OA)	Recognition of different expressions of the target value
Subject Specificity (SS)	Ability to distinguish similar subjects
Relation Specificity (RS)	Ability to distinguish similar relations
Object Specificity (OS)	Ability to distinguish similar objects
1-N Forgetting (1-NF)	Forgetting issues in one-to-many relationships
Combined Evaluation (CC)	Scenarios combining the above criteria
Open Domain (OD)	Real-world open scenarios

The dataset has been open-sourced on HuggingFace, adopting a hierarchical structure (JSON files for each domain under train/test directories), and can be quickly deployed and used via the GitHub repository.

Section 06

Practical Significance and Application Prospects of UniEdit

The launch of UniEdit has important value:

Standardized Evaluation: Provides a fair and comprehensive comparison benchmark for algorithms
Defect Discovery: Fine-grained evaluation reveals blind spots of existing methods
Design Guidance: Helps target improvements in editing technology
Promoting Development: Large-scale open-domain features are close to real-world applications. It is an indispensable tool in the field of knowledge editing, laying the foundation for the practical application of LLMs.

UniEdit: A Unified Knowledge Editing Evaluation Benchmark for Large Language Models

UniEdit: Guide to the Unified Evaluation Benchmark for Knowledge Editing in Large Language Models

Background: Demand for Knowledge Editing and Limitations of Existing Benchmarks

Core Design of UniEdit: Scale, Domains, and Three Evaluation Dimensions

Data Generation: NMCS Algorithm Facilitates Diversified Sample Construction

Fine-grained Evaluation Dimensions and Open-source Dataset Structure

Practical Significance and Application Prospects of UniEdit

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model