Zing Forum

Reading

LaRA: A New Method for Detecting Data Contamination in RL-Finetuned Large Models

LaRA proposes a framework based on hierarchical representation analysis, using three complementary metrics—perturbation sensitivity, direction collapse, and local rigidity—to effectively detect data contamination in RL-finetuned large language models (LLMs).

数据污染检测强化学习大语言模型表示学习模型评估RL后训练
Published 2026-05-28 21:13Recent activity 2026-05-29 14:24Estimated read 7 min
LaRA: A New Method for Detecting Data Contamination in RL-Finetuned Large Models
1

Section 01

LaRA: A New Method for Detecting Data Contamination in RL-Finetuned Large Models (Introduction)

LaRA is a framework based on hierarchical representation analysis. It uses three complementary metrics—perturbation sensitivity, direction collapse, and local rigidity—to effectively detect data contamination in RL-finetuned LLMs. This method breaks through the limitations of traditional approaches that rely on output layer signals, delving into the model's internal representation space to analyze changes in geometric properties, thus providing a new tool for data quality detection in AI models.

2

Section 02

Background and Problem

Reinforcement Learning (RL) finetuning is an important method to improve the reasoning ability of large language models, but the problem of detecting data contamination in RL finetuning has long been overlooked. Data contamination refers to the mixing of training data into the test set, leading to inflated evaluation results. Traditional detection methods rely on output layer signals (such as token likelihood), which have limited effectiveness for RL-finetuned models because RL shapes behavior through trajectory-level rewards rather than token likelihood.

3

Section 03

Core Ideas of the LaRA Framework and Three Key Metrics

The core idea of the LaRA framework is to delve into the model's internal representation space and analyze changes in the geometric properties of each hidden layer (memorizing contaminated data leads to geometric changes in the representation space that propagate across layers). The three complementary metrics are: 1. Perturbation Sensitivity: The degree of change in internal representations after minor input perturbations; contaminated data makes the model overly sensitive to specific inputs. 2. Direction Collapse: Contamination causes representations of related inputs to compress into similar directions, reducing diversity. 3. Local Rigidity: The stiffness of local neighborhoods in the representation space; contamination makes certain regions lack flexibility.

4

Section 04

Design of the LaRA Detection Protocol

Steps of the LaRA detection protocol: 1. Apply controlled perturbations to candidate samples and extract representation vectors from each hidden layer. 2. Calculate the values of the three metrics for each layer. 3. Aggregate results across layers and metrics to form a contamination score. Advantages of hierarchical aggregation: Contamination signals are weak in shallow layers and amplified in deep layers; integrating multi-layer information improves detection sensitivity.

5

Section 05

Experimental Validation and Results

The research team conducted experiments on RL-finetuned reasoning models. The results show that the LaRA detection protocol significantly outperforms traditional output-layer baseline methods. LaRA maintains high detection accuracy across different types of data contamination scenarios, verifying the superiority of representation layer analysis—even if RL changes the output behavior, the geometric properties of internal representations still retain traces of contamination.

6

Section 06

Practical Significance and Application Prospects

LaRA can serve as a data quality detection tool for AI research institutions and enterprises: screening training data before training, and verifying whether the test set is contaminated during evaluation. It also provides a new perspective for understanding the impact of RL training on the model's internal representations—by analyzing changes in geometric properties across layers, we can gain deeper insights into the RL training mechanism.

7

Section 07

Summary and Outlook

LaRA breaks through the limitations of traditional methods and innovatively uses hierarchical representation analysis to identify contamination. The three complementary metrics reflect an understanding of the geometric properties of the representation space, and the hierarchical aggregation strategy leverages the layer-wise propagation characteristics of contamination signals. In the future, it is expected to be extended to more models and training paradigms, becoming one of the standard tools for AI model quality assurance.