Zing Forum

Reading

Research on Privacy Leakage in Large Language Models: Analysis of Security Threats from Inference Stealing and Output Drift

This article delves into the privacy leakage issues of Large Language Models (LLMs), analyzes inference stealing attacks and output drift phenomena, and reveals the security challenges and protection strategies faced by LLMs in practical deployment.

LLM安全隐私泄露推理窃取输出漂移AI安全数据保护机器学习攻击
Published 2026-05-03 05:07Recent activity 2026-05-03 09:29Estimated read 7 min
Research on Privacy Leakage in Large Language Models: Analysis of Security Threats from Inference Stealing and Output Drift
1

Section 01

[Introduction] Research on Privacy Leakage in LLMs: Analysis of Security Threats from Inference Stealing and Output Drift

This article focuses on the privacy leakage issues of Large Language Models (LLMs), deeply analyzes inference stealing attacks and output drift phenomena. Based on the llm-privacy-leakage research project developed by AdamOwolabi, it discusses the security challenges and protection strategies of LLMs in practical deployment, covering key content such as background, core concepts, experimental design, potential impacts, and protective measures.

2

Section 02

Research Background and Motivation

Large language models are trained on massive amounts of data and may contain sensitive information. Although developers try their best to clean the data, the models may still memorize and leak private content. What is more concerning is that attackers can induce the model to output sensitive information through carefully designed queries (without accessing model parameters or training data, only via API interfaces). This attack method is called 'inference stealing'.

3

Section 03

Analysis of Core Concepts: Inference Stealing and Output Drift

Inference Stealing

Inference stealing is an attack method targeting LLM inference services. By constructing queries and analyzing outputs, it infers sensitive information such as training data and architectural details, including:

  • Data extraction attacks: Inducing the model to repeat sensitive paragraphs
  • Membership inference attacks: Determining whether a sample was used for training
  • Attribute inference attacks: Inferring statistical characteristics of training data

Output Drift

Output drift refers to the phenomenon where model outputs change over time. The causes include:

  • Model updates, prompt contamination, adversarial adaptation, and context window accumulation Output drift may cause safety guardrails to fail and increase leakage risks.
4

Section 04

Technical Implementation and Experimental Design

The llm-privacy-leakage project adopts a systematic experimental approach:

  1. Baseline measurement: Establishing a model output baseline in a controlled environment
  2. Attack simulation: Implementing inference stealing attacks and recording success rates
  3. Drift monitoring: Long-term tracking of output change trends
  4. Comparative analysis: Comparing the performance of different model architectures and scales

Key findings:

  • Commercially aligned LLMs may still leak fragments of training data
  • Model outputs are highly sensitive to minor changes in prompts
  • Output drift exists, affecting security and consistency
5

Section 05

Potential Impacts of Privacy Leakage

Individual Users

  • Increased risk of identity theft
  • Violation of personal privacy
  • Disclosure of sensitive conversation content

Enterprises

  • Leakage of trade secrets
  • Exposure of customer data
  • Compliance risks (GDPR, CCPA, etc.)

Model Developers

  • Legal liability and reputation risks
  • Need to invest more resources in security research
  • May restrict API access or increase costs
6

Section 06

Existing Protection Strategies and Limitations

Data Level

  • Differential privacy training: Adding noise to reduce sample memorization, but the privacy-utility trade-off is difficult to optimize
  • Data cleaning: Removing sensitive information, but over-cleaning may reduce performance
  • Data synthesis: Replacing real sensitive data with synthetic data

Inference Level

  • Output filtering: Blocking sensitive responses, but may be bypassed by adversarial attacks
  • Rate limiting: Increasing attack costs, but affecting user experience
  • Query auditing: Recording abnormal patterns, but increasing system complexity

Alignment Training

Training models to reject privacy-related questions through RLHF, but 'jailbreak' prompts can bypass safety guardrails

7

Section 07

Recommendations and Future Research Directions

Recommendations

  • Developers: Adopt the zero-trust principle, implement multi-layer protection, establish an audit and monitoring system, and deploy sensitive data on local/private clouds
  • Users: Avoid inputting sensitive information, understand privacy policies, and maintain critical thinking about AI content

Future Research Directions

  1. Establish standardized privacy leakage risk assessment indicators
  2. Develop tools for real-time monitoring of output drift and abnormal queries
  3. Design adaptive protection mechanisms
  4. Systematically compare the privacy characteristics of different models