# Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation

> A PyTorch prototype project that explores privacy-preserving large model inference using masking and padding techniques, verifies the correctness of obfuscated execution of Transformer models in a simulated Trusted Execution Environment (TEE), and provides technical references for the integration of privacy computing and AI inference.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-03T01:41:43.000Z
- 最近活动: 2026-06-03T01:56:53.937Z
- 热度: 163.8
- 关键词: 隐私计算, TEE, 大模型, Transformer, 掩码混淆, GPT-2, PyTorch, 安全推理, KV Cache, 注意力机制
- 页面链接: https://www.zingnex.cn/en/forum/thread/privacy-llm-inference
- Canonical: https://www.zingnex.cn/forum/thread/privacy-llm-inference
- Markdown 来源: floors_fallback

---

## [Introduction] Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation

Privacy LLM Inference is a PyTorch prototype project. Its core is to explore privacy-preserving large model inference using masking and padding techniques in a simulated Trusted Execution Environment (TEE), verify the correctness of obfuscated execution of Transformer models, and provide technical references for the integration of privacy computing and AI inference. Its goal is to solve the privacy protection problem of input data and model parameters during large model inference in untrusted GPU environments.

## Background and Core Concepts: Dual-Domain Execution Model and Security Boundaries

### Dual-Domain Execution Model
- **Trusted Domain (SimulatedTEE)**：Has privileges such as accessing plaintext input, generating masks/padding, managing LoRA adapters, generating compensation tensors, and performing output deobfuscation.
- **Untrusted Domain (UntrustedGPUExecutor)**：Can only process obfuscated input, transformed weights/adapters, compensation tensors, and obfuscated output, and cannot recover plaintext.

### Security Boundary Notes
The current version is a prototype for algebraic correctness verification, and does not provide real security isolation, side-channel protection, memory isolation, authentication, or production-grade TEE guarantees. The interface design is prepared for subsequent integration with real TEEs.

## Technical Implementation Evolution: From Basic Linear Layers to Full GPT-2 Verification

The project iterates in phases:
1. **Stage1**：Basic linear layer obfuscation verification;
2. **Stage1-LoRA**：Extend the obfuscation mechanism to LoRA adapters;
3. **Stage2**：Implement complete Transformer Block (including attention, residual connections, etc.);
4. **Stage3**：Support Prefill/Decode and KV Cache;
5. **Stage4.x**：Integrate HuggingFace and verify each module of GPT-2;
6. **Stage5.0**：Experimental verification (attention probes, workload analyzer).

## Key Technical Details: Mask Padding Mechanism and Attention/KV Cache Management

### Mask and Padding Mechanism
- **Mask Mode**：X_tilde = X·N_in, W_tilde = N_in⁻¹·W·N_out, Y = Y_tilde·N_out⁻¹;
- **Padding Mode**：Introduce compensation tensor T, X_tilde=(X-T)·N_in, Y_tilde=X_tilde·W_tilde + C_T (C_T=T·W·N_out).

### Attention Mask Propagation
Constrain N_Q·N_Kᵀ=I to ensure Q_tilde·K_tildeᵀ=Q·Kᵀ, so attention scores are computed in the plaintext space.

### KV Cache Management
Each head maintains independent N_K/N_V; Prefill samples masks, Decode reuses them, ensuring the invariance of K_tilde=K·N_K and V_tilde=V·N_V.

## Experiments and Verification: Correctness and Performance Analysis

### Experiment Scripts
- `run_experiment_summary.py`：Re-execute verification for each phase and generate summary results (JSON/CSV/MD);
- `run_attention_experiments.py`：Scan parameters like batch_size, seq_len, etc., to verify attention invariance.

### Workload Analysis
Compare TEE/GPU cost models of five execution strategies: plain_hf_gpu, tslp_trusted_nonlinear_baseline, ours_current, ours_ideal_gpu_nonlinear, amulet_style_reference.

## Current Limitations and Disclaimer

### Engineering Simplifications
Adopt simplifications like trusted LayerNorm and trusted GELU; do not implement full obfuscation, prioritizing end-to-end correctness verification.

### Security Statement
Does not provide real TEE isolation, side-channel protection, memory isolation, or authentication mechanisms; not a production-ready solution.

### Research Nature
Used to verify algebraic correctness and explore the feasibility of TEE+GPU collaborative architecture, providing references for production solutions.

## Applicable Scenarios and Value

The project has reference value for the following fields:
- Privacy computing research: Explore privacy-preserving schemes for TEE and GPU collaboration;
- Large model secure deployment: Secure inference in untrusted environments;
- Federated learning: Reference for the inference side of distributed privacy-preserving training;
- Enterprise AI deployment: Scenarios where model parameters and user data need protection;
- Academic writing: Provide experimental data and technical details for support.

## Summary: Significance of the Research Prototype

Privacy LLM Inference constructs a privacy-preserving large model inference scheme based on mask obfuscation through systematic phased verification. Although it is a research prototype, its rigorous mathematical design and complete experimental verification process provide valuable technical references for the cutting-edge field of integrating privacy computing and large models.