# DGAO: Addressing the Order Sensitivity of Large Language Models with Reinforcement Learning

> The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen) and Baidu Research jointly propose the DGAO framework, which for the first time introduces reinforcement learning into the research of order fairness in large language models (LLMs), significantly reducing order sensitivity while improving model accuracy.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-12T11:31:18.000Z
- 最近活动: 2026-05-13T02:47:59.552Z
- 热度: 131.7
- 关键词: 大语言模型, 顺序公平性, 强化学习, RAG, DGAO, 机器学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/dgao
- Canonical: https://www.zingnex.cn/forum/thread/dgao
- Markdown 来源: floors_fallback

---

## [Introduction] DGAO Framework: Addressing the Order Sensitivity of Large Language Models with Reinforcement Learning

The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen) and Baidu Research jointly propose the DGAO (Dual Group Advantage Optimization) framework, which for the first time introduces reinforcement learning into the research of order fairness in large language models (LLMs). It significantly reduces order sensitivity while improving model accuracy, providing a new solution to the order bias problem of LLMs.

## Background: The Order Sensitivity Problem of LLMs and Limitations of Existing Methods

### The Order Sensitivity Problem
Large language models exhibit order sensitivity when processing inputs: the same information presented in different orders may lead to drastically different output quality, especially affecting scenarios like RAG (Retrieval-Augmented Generation) and in-context learning, reducing model reliability and fairness.

### Dilemmas of Existing Methods
- **Statistical/search methods**: Attempt to find the optimal input permutation, but increase inference overhead and fail to fundamentally solve order bias;
- **Supervised fine-tuning methods**: Train with multi-order variants to mitigate sensitivity but sacrifice accuracy, easily leading to excessive stability of the model on wrong information (hallucinatory outputs).

## DGAO Framework: Core Design of Dual Group Advantage Optimization

### Core Idea
DGAO achieves its goals by optimizing two dimensions simultaneously:
1. **Intra-group relative accuracy advantage**: Encourage correct outputs under the same input order;
2. **Inter-group relative stability advantage**: Encourage stable performance across different input orders.

### Technical Implementation
Adopt a reinforcement learning training paradigm:
- Generate multiple order variants for the same set of inputs;
- Evaluate the model's performance under different orders;
- Calculate accuracy and stability advantages;
- Update parameters via policy gradients to make the model focus on content semantics rather than input order.

## New Evaluation Metrics: Key Tools for Identifying Pseudo-Stability

The research team proposes two new metrics to comprehensively evaluate order fairness:
- **Consistency rate**: Measures the consistency of outputs across different input orders;
- **Overconfidence rate**: Reveals the false stability of the model on wrong answers (remaining consistent even when hallucinating), which can identify behaviors that are seemingly stable but actually incorrect.

## Experimental Evidence: Performance of DGAO

Experimental results on RAG, mathematical reasoning, and classification tasks:
- Significantly reduce order sensitivity while maintaining high accuracy;
- Outperform existing methods in order fairness;
- Strong generalization ability, adapting to different domains and tasks;
- Improve overall model performance, achieving a balance between accuracy and stability.

## Significance and Outlook: Reinforcement Learning Empowers Model Fairness Research

### Research Significance
DGAO opens up a new direction for using reinforcement learning to improve the robustness and fairness of LLMs.

### Future Outlook
As LLMs are increasingly applied in critical scenarios, order fairness will become more important. DGAO provides a scalable solution and new ideas for model training.

### Open Source Information
The project code has been open-sourced: https://github.com/Hyalinesky/DGAO

## Conclusion: Focus on the Fairness and Consistency of LLMs

The order sensitivity problem of LLMs has long been overlooked, but it actually affects model reliability and fairness. DGAO provides an elegant solution to this problem through the clever application of reinforcement learning. This work reminds us that while pursuing model capabilities, we need to pay attention to the fairness and consistency of their behaviors.