Reading

DGAO: Addressing the Order Sensitivity of Large Language Models with Reinforcement Learning

The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen) and Baidu Research jointly propose the DGAO framework, which for the first time introduces reinforcement learning into the research of order fairness in large language models (LLMs), significantly reducing order sensitivity while improving model accuracy.

大语言模型顺序公平性强化学习RAGDGAO机器学习

Published 2026-05-12 19:31Recent activity 2026-05-13 10:47Estimated read 6 min

DGAO: Addressing the Order Sensitivity of Large Language Models with Reinforcement Learning

Section 01

[Introduction] DGAO Framework: Addressing the Order Sensitivity of Large Language Models with Reinforcement Learning

The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen) and Baidu Research jointly propose the DGAO (Dual Group Advantage Optimization) framework, which for the first time introduces reinforcement learning into the research of order fairness in large language models (LLMs). It significantly reduces order sensitivity while improving model accuracy, providing a new solution to the order bias problem of LLMs.

Section 02

Background: The Order Sensitivity Problem of LLMs and Limitations of Existing Methods

The Order Sensitivity Problem

Large language models exhibit order sensitivity when processing inputs: the same information presented in different orders may lead to drastically different output quality, especially affecting scenarios like RAG (Retrieval-Augmented Generation) and in-context learning, reducing model reliability and fairness.

Dilemmas of Existing Methods

Statistical/search methods: Attempt to find the optimal input permutation, but increase inference overhead and fail to fundamentally solve order bias;
Supervised fine-tuning methods: Train with multi-order variants to mitigate sensitivity but sacrifice accuracy, easily leading to excessive stability of the model on wrong information (hallucinatory outputs).

Section 03

DGAO Framework: Core Design of Dual Group Advantage Optimization

Core Idea

DGAO achieves its goals by optimizing two dimensions simultaneously:

Intra-group relative accuracy advantage: Encourage correct outputs under the same input order;
Inter-group relative stability advantage: Encourage stable performance across different input orders.

Technical Implementation

Adopt a reinforcement learning training paradigm:

Generate multiple order variants for the same set of inputs;
Evaluate the model's performance under different orders;
Calculate accuracy and stability advantages;
Update parameters via policy gradients to make the model focus on content semantics rather than input order.

Section 04

New Evaluation Metrics: Key Tools for Identifying Pseudo-Stability

The research team proposes two new metrics to comprehensively evaluate order fairness:

Consistency rate: Measures the consistency of outputs across different input orders;
Overconfidence rate: Reveals the false stability of the model on wrong answers (remaining consistent even when hallucinating), which can identify behaviors that are seemingly stable but actually incorrect.

Section 05

Experimental Evidence: Performance of DGAO

Experimental results on RAG, mathematical reasoning, and classification tasks:

Significantly reduce order sensitivity while maintaining high accuracy;
Outperform existing methods in order fairness;
Strong generalization ability, adapting to different domains and tasks;
Improve overall model performance, achieving a balance between accuracy and stability.

Section 06

Significance and Outlook: Reinforcement Learning Empowers Model Fairness Research

Research Significance

DGAO opens up a new direction for using reinforcement learning to improve the robustness and fairness of LLMs.

Future Outlook

As LLMs are increasingly applied in critical scenarios, order fairness will become more important. DGAO provides a scalable solution and new ideas for model training.

Open Source Information

The project code has been open-sourced: https://github.com/Hyalinesky/DGAO

Section 07

Conclusion: Focus on the Fairness and Consistency of LLMs

The order sensitivity problem of LLMs has long been overlooked, but it actually affects model reliability and fairness. DGAO provides an elegant solution to this problem through the clever application of reinforcement learning. This work reminds us that while pursuing model capabilities, we need to pay attention to the fairness and consistency of their behaviors.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15