Reading

Three Paradigms of RLHF Annotation: Extension, Evidence, and Authority

This article distinguishes three normative roles of human annotation in RLHF—Extension, Evidence, and Authority—analyzes the impact of different paradigms on annotation process design, and proposes suggestions for customizing annotation strategies by dimension.

RLHF人类反馈AI对齐标注伦理价值对齐AI治理规范性理论

Published 2026-04-29 01:39Recent activity 2026-04-29 11:05Estimated read 8 min

Three Paradigms of RLHF Annotation: Extension, Evidence, and Authority

Section 01

Introduction: Three Paradigms of RLHF Annotation and Practical Recommendations

Reinforcement Learning from Human Feedback (RLHF) is the current mainstream method for aligning large language models, but the issue of the normative role of annotators' judgments has long been overlooked. This article distinguishes three paradigms of RLHF annotation—Extension, Evidence, and Authority—analyzes the problems of model confusion and failure modes in existing research, and puts forward core suggestions for customizing strategies by annotation dimension to achieve a more reasonable, fair, and transparent AI alignment process.

Section 02

The Essential Dilemma of RLHF and Implicit Assumptions in Existing Research

The Essential Dilemma of RLHF

RLHF has become the core of alignment for top models (e.g., ChatGPT, Claude), but a fundamental issue is overlooked: the normative role of annotators' judgments (executing the designer's will, providing independent evidence, or representing group decisions?) directly affects process design.

Implicit Assumptions in Existing Research

InstructGPT/ChatGPT: Mainly based on the Extension model, emphasizing consistency with researchers' expectations, but with Evidence model characteristics in content security domains;
Constitutional AI: Mixes Extension (designers formulate constitutional principles) and Evidence models (annotators interpret and apply them);
Crowdsourcing platforms: Assume the Evidence model (aggregation via majority voting), but strict guidelines push towards the Extension model.

Section 03

Three Conceptual Models of RLHF Annotation

Model 1: Extension

Core: Annotators are extensions of designers, reflecting their values. Operational logic: Designers define clear standards → Annotators receive training → Quality is measured by consistency → Disagreements are considered errors. Scenarios: Technical document proofreading, code syntax evaluation, etc. Advantages: Clear standards, easy quality control; Risks: Amplifies designer bias, ignores diverse values.

Model 2: Evidence

Core: Annotators provide independent factual evidence. Operational logic: Inter-subjectively verifiable facts exist → Annotators collect them → Aggregation enhances evidence → Disagreements reflect diversity. Scenarios: Content security norms, cultural sensitivity assessment, etc. Advantages: Captures social diversity; Risks: Blurred line between facts and values, sample bias.

Model 3: Authority

Core: Annotators have decision-making authority as representatives of a group. Operational logic: Affected groups participate in decision-making → Annotators are democratic representatives → Collective judgments are binding → Designers implement them. Scenarios: Medical/legal AI, localization of public services. Advantages: Enhances democratic legitimacy; Risks: Insufficient representation, unclear rights and responsibilities, low efficiency.

Section 04

Failure Modes of Confused Annotation Models

Failure Mode 1: Extension Disguised as Evidence

Designers claim to reflect user preferences (Evidence) but actually strictly control standards (Extension) → The system is bound to the designer's values without transparency, ignoring diverse needs.

Failure Mode 2: Democratic Claims Without Authority

Claims to represent public interests (Authority), but annotators lack representation and the process has no accountability → Values of specific groups are imposed, with no one responsible for consequences.

Failure Mode 3: Evidence Treated as Extension

Annotators' social insights (Evidence) are regarded as execution deviations (Extension) → Valuable information is filtered out, and the system becomes disconnected from reality.

Section 05

Philosophical Significance of RLHF Annotation Paradigms and Implications for AI Governance

Democratization of Value Alignment

Currently, RLHF is mostly done internally by enterprises. The Authority model provides a framework for democratic alignment, but challenges of representation and accountability need to be addressed.

Recognition of Plural Values

The Evidence and Authority models indicate: There is no single 'correct' value system; AI needs to adapt to reasonable pluralism.

Ethical Requirements for Transparency

Users have the right to know: Who sets the standards, how standards are formulated and updated, and how to appeal and correct them.

Section 06

Practical Recommendations: Customize RLHF Annotation Strategies by Dimension

Example of Dimension Decomposition

Factual accuracy → Extension model: Clear standard answers, strict training and quality inspection;
User experience → Evidence model: Collect subjective feelings, tolerate disagreements, aggregate real distribution;
Value trade-offs → Authority model: Clear representation, transparent process, establish accountability.

Implementation Key Points

Clearly state the model for each dimension and the reasons;
Ensure training/quality inspection/aggregation methods are consistent with the model;
Monitor model drift and regularly check consistency;
The Authority model requires participation of affected groups.