Zing Forum

Reading

Reproduction of the MIN-K% Prob Method: Detecting Membership Inference Attacks in Large Language Model Pre-training Data

Complete reproduction and extended analysis of the ICLR 2024 paper, verifying the effectiveness of MIN-K% Prob in membership inference attacks and finding that model size and text length have significant impacts on detection quality.

成员推断攻击大语言模型MIN-K% Prob数据隐私WikiMIA预训练数据检测模型安全ICLR 2024
Published 2026-04-22 05:43Recent activity 2026-04-22 05:50Estimated read 7 min
Reproduction of the MIN-K% Prob Method: Detecting Membership Inference Attacks in Large Language Model Pre-training Data
1

Section 01

[Introduction] Reproduction of the MIN-K% Prob Method: Detecting Membership Inference Attacks in Large Language Model Pre-training Data

This article presents a complete reproduction and extended analysis of the ICLR 2024 paper 'MIN-K% Prob Method', verifying the method's effectiveness in membership inference attacks and finding that model size and text length have significant impacts on detection quality. Focusing on the privacy and security issues of large language model pre-training data, it uses black-box analysis of model probability distributions to determine membership relationships, providing a practical tool for model auditing and privacy protection.

2

Section 02

Background: Research Significance of Membership Inference Attacks and Limitations of Traditional Methods

With the widespread application of large language models, Membership Inference Attacks (MIA) have become a key privacy issue: given a text and black-box access to a model, determine whether the text belongs to the pre-training data. Traditional MIA methods require reference to the model or training corpus, limiting their practical application. The MIN-K% Prob method proposed in the 2024 ICLR paper breaks this limitation by determining membership relationships solely through analyzing the model's probability distribution of "difficult tokens".

3

Section 03

Core Mechanism of the MIN-K% Prob Method

MIN-K% Prob is based on the observation that there are differences in probability distributions between texts seen during pre-training (members) and unseen texts (non-members). Non-member texts have "outlier tokens" with abnormally low probabilities, while member texts do not. Algorithm steps: 1. Obtain the conditional log probability of each token in the text; 2. Select the k% tokens with the lowest probabilities; 3. Calculate the average log-likelihood of these tokens as the detection score—scores closer to 0 indicate higher likelihood of being a member, while more negative scores indicate higher likelihood of being a non-member.

4

Section 04

Experimental Design and Reproduction Environment

The reproduction uses the free T4 GPU environment of Google Colab to complete 5 progressive experiments. The models used are from EleutherAI's Pythia family (70M to 2.8B parameters), and the evaluation benchmark is WikiMIA: member data comes from Wikipedia articles before 2017 (visible to Pythia during pre-training), non-member data comes from event articles after 2023 (unseen by the model), and time segmentation ensures fair and reliable evaluation.

5

Section 05

Key Experimental Findings: Effectiveness and Comparison with Baselines

  1. Fine-tuning comparison verifies the basic effectiveness of MIN-K%: the model can distinguish between fine-tuned articles and unseen texts; 2. Compared with baseline methods (PPL, Zlib entropy), MIN-K% consistently outperforms baselines on handwritten datasets; 3. In the WikiMIA benchmark, Pythia-2.8B achieves an AUC of 0.5956 (lower than the paper's 0.67, but the trend is consistent with the paper due to small sample size and limited resources, proving reproducibility).
6

Section 06

Hyperparameter Tuning and Analysis of Model Size Effects

Hyperparameter scanning found: the optimal k value for smaller models is 10% (20% for large models in the paper), and k value selection is sensitive to model size; multi-model research confirms that detection quality is positively correlated with model size and text length—larger models have stronger memory capabilities, and longer texts provide more signals, enhancing detection significance.

7

Section 07

Advantages and Limitations of the MIN-K% Prob Method

Advantages: Black-box friendly—no need to reference the model or training data, only requires probability output to work, with high practical value. Limitations: 1. Detection accuracy still has room for improvement (AUC around 0.6); 2. The optimal k value depends on model size, increasing deployment complexity; 3. The applicability to instruction-tuned or RLHF models needs further verification.

8

Section 08

Privacy and Security Implications and Open Source Contributions

Privacy implications: Provides developers with an auditing tool to detect sensitive data, and also reveals the vulnerability of model memory; in terms of compliance, membership inference detection tools will become part of model compliance, requiring a balance between performance and privacy (e.g., differential privacy training, data deduplication). Open source contributions: The project is open-sourced under the MIT license, providing complete Notebook implementations, code, visualizations, and documentation to facilitate academic reproduction and community improvement, serving as a starting point for MIA researchers.