# Three Paradigms of RLHF Annotation: Extension, Evidence, and Authority

> This article distinguishes three normative roles of human annotation in RLHF—Extension, Evidence, and Authority—analyzes the impact of different paradigms on annotation process design, and proposes suggestions for customizing annotation strategies by dimension.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-28T17:39:14.000Z
- 最近活动: 2026-04-29T03:05:24.967Z
- 热度: 130.6
- 关键词: RLHF, 人类反馈, AI对齐, 标注伦理, 价值对齐, AI治理, 规范性理论
- 页面链接: https://www.zingnex.cn/en/forum/thread/rlhf
- Canonical: https://www.zingnex.cn/forum/thread/rlhf
- Markdown 来源: floors_fallback

---

## Introduction: Three Paradigms of RLHF Annotation and Practical Recommendations

Reinforcement Learning from Human Feedback (RLHF) is the current mainstream method for aligning large language models, but the issue of the normative role of annotators' judgments has long been overlooked. This article distinguishes three paradigms of RLHF annotation—Extension, Evidence, and Authority—analyzes the problems of model confusion and failure modes in existing research, and puts forward core suggestions for customizing strategies by annotation dimension to achieve a more reasonable, fair, and transparent AI alignment process.

## The Essential Dilemma of RLHF and Implicit Assumptions in Existing Research

### The Essential Dilemma of RLHF
RLHF has become the core of alignment for top models (e.g., ChatGPT, Claude), but a fundamental issue is overlooked: the normative role of annotators' judgments (executing the designer's will, providing independent evidence, or representing group decisions?) directly affects process design.

### Implicit Assumptions in Existing Research
- **InstructGPT/ChatGPT**: Mainly based on the Extension model, emphasizing consistency with researchers' expectations, but with Evidence model characteristics in content security domains;
- **Constitutional AI**: Mixes Extension (designers formulate constitutional principles) and Evidence models (annotators interpret and apply them);
- **Crowdsourcing platforms**: Assume the Evidence model (aggregation via majority voting), but strict guidelines push towards the Extension model.

## Three Conceptual Models of RLHF Annotation

### Model 1: Extension
Core: Annotators are extensions of designers, reflecting their values. Operational logic: Designers define clear standards → Annotators receive training → Quality is measured by consistency → Disagreements are considered errors. Scenarios: Technical document proofreading, code syntax evaluation, etc. Advantages: Clear standards, easy quality control; Risks: Amplifies designer bias, ignores diverse values.

### Model 2: Evidence
Core: Annotators provide independent factual evidence. Operational logic: Inter-subjectively verifiable facts exist → Annotators collect them → Aggregation enhances evidence → Disagreements reflect diversity. Scenarios: Content security norms, cultural sensitivity assessment, etc. Advantages: Captures social diversity; Risks: Blurred line between facts and values, sample bias.

### Model 3: Authority
Core: Annotators have decision-making authority as representatives of a group. Operational logic: Affected groups participate in decision-making → Annotators are democratic representatives → Collective judgments are binding → Designers implement them. Scenarios: Medical/legal AI, localization of public services. Advantages: Enhances democratic legitimacy; Risks: Insufficient representation, unclear rights and responsibilities, low efficiency.

## Failure Modes of Confused Annotation Models

### Failure Mode 1: Extension Disguised as Evidence
Designers claim to reflect user preferences (Evidence) but actually strictly control standards (Extension) → The system is bound to the designer's values without transparency, ignoring diverse needs.

### Failure Mode 2: Democratic Claims Without Authority
Claims to represent public interests (Authority), but annotators lack representation and the process has no accountability → Values of specific groups are imposed, with no one responsible for consequences.

### Failure Mode 3: Evidence Treated as Extension
Annotators' social insights (Evidence) are regarded as execution deviations (Extension) → Valuable information is filtered out, and the system becomes disconnected from reality.

## Philosophical Significance of RLHF Annotation Paradigms and Implications for AI Governance

### Democratization of Value Alignment
Currently, RLHF is mostly done internally by enterprises. The Authority model provides a framework for democratic alignment, but challenges of representation and accountability need to be addressed.

### Recognition of Plural Values
The Evidence and Authority models indicate: There is no single 'correct' value system; AI needs to adapt to reasonable pluralism.

### Ethical Requirements for Transparency
Users have the right to know: Who sets the standards, how standards are formulated and updated, and how to appeal and correct them.

## Practical Recommendations: Customize RLHF Annotation Strategies by Dimension

### Example of Dimension Decomposition
- **Factual accuracy** → Extension model: Clear standard answers, strict training and quality inspection;
- **User experience** → Evidence model: Collect subjective feelings, tolerate disagreements, aggregate real distribution;
- **Value trade-offs** → Authority model: Clear representation, transparent process, establish accountability.

### Implementation Key Points
1. Clearly state the model for each dimension and the reasons;
2. Ensure training/quality inspection/aggregation methods are consistent with the model;
3. Monitor model drift and regularly check consistency;
4. The Authority model requires participation of affected groups.
