Zing Forum

Reading

Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation

A PyTorch prototype project that explores privacy-preserving large model inference using masking and padding techniques, verifies the correctness of obfuscated execution of Transformer models in a simulated Trusted Execution Environment (TEE), and provides technical references for the integration of privacy computing and AI inference.

隐私计算TEE大模型Transformer掩码混淆GPT-2PyTorch安全推理KV Cache注意力机制
Published 2026-06-03 09:41Recent activity 2026-06-03 09:56Estimated read 7 min
Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation
1

Section 01

[Introduction] Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation

Privacy LLM Inference is a PyTorch prototype project. Its core is to explore privacy-preserving large model inference using masking and padding techniques in a simulated Trusted Execution Environment (TEE), verify the correctness of obfuscated execution of Transformer models, and provide technical references for the integration of privacy computing and AI inference. Its goal is to solve the privacy protection problem of input data and model parameters during large model inference in untrusted GPU environments.

2

Section 02

Background and Core Concepts: Dual-Domain Execution Model and Security Boundaries

Dual-Domain Execution Model

  • Trusted Domain (SimulatedTEE):Has privileges such as accessing plaintext input, generating masks/padding, managing LoRA adapters, generating compensation tensors, and performing output deobfuscation.
  • Untrusted Domain (UntrustedGPUExecutor):Can only process obfuscated input, transformed weights/adapters, compensation tensors, and obfuscated output, and cannot recover plaintext.

Security Boundary Notes

The current version is a prototype for algebraic correctness verification, and does not provide real security isolation, side-channel protection, memory isolation, authentication, or production-grade TEE guarantees. The interface design is prepared for subsequent integration with real TEEs.

3

Section 03

Technical Implementation Evolution: From Basic Linear Layers to Full GPT-2 Verification

The project iterates in phases:

  1. Stage1:Basic linear layer obfuscation verification;
  2. Stage1-LoRA:Extend the obfuscation mechanism to LoRA adapters;
  3. Stage2:Implement complete Transformer Block (including attention, residual connections, etc.);
  4. Stage3:Support Prefill/Decode and KV Cache;
  5. Stage4.x:Integrate HuggingFace and verify each module of GPT-2;
  6. Stage5.0:Experimental verification (attention probes, workload analyzer).
4

Section 04

Key Technical Details: Mask Padding Mechanism and Attention/KV Cache Management

Mask and Padding Mechanism

  • Mask Mode:X_tilde = X·N_in, W_tilde = N_in⁻¹·W·N_out, Y = Y_tilde·N_out⁻¹;
  • Padding Mode:Introduce compensation tensor T, X_tilde=(X-T)·N_in, Y_tilde=X_tilde·W_tilde + C_T (C_T=T·W·N_out).

Attention Mask Propagation

Constrain N_Q·N_Kᵀ=I to ensure Q_tilde·K_tildeᵀ=Q·Kᵀ, so attention scores are computed in the plaintext space.

KV Cache Management

Each head maintains independent N_K/N_V; Prefill samples masks, Decode reuses them, ensuring the invariance of K_tilde=K·N_K and V_tilde=V·N_V.

5

Section 05

Experiments and Verification: Correctness and Performance Analysis

Experiment Scripts

  • run_experiment_summary.py:Re-execute verification for each phase and generate summary results (JSON/CSV/MD);
  • run_attention_experiments.py:Scan parameters like batch_size, seq_len, etc., to verify attention invariance.

Workload Analysis

Compare TEE/GPU cost models of five execution strategies: plain_hf_gpu, tslp_trusted_nonlinear_baseline, ours_current, ours_ideal_gpu_nonlinear, amulet_style_reference.

6

Section 06

Current Limitations and Disclaimer

Engineering Simplifications

Adopt simplifications like trusted LayerNorm and trusted GELU; do not implement full obfuscation, prioritizing end-to-end correctness verification.

Security Statement

Does not provide real TEE isolation, side-channel protection, memory isolation, or authentication mechanisms; not a production-ready solution.

Research Nature

Used to verify algebraic correctness and explore the feasibility of TEE+GPU collaborative architecture, providing references for production solutions.

7

Section 07

Applicable Scenarios and Value

The project has reference value for the following fields:

  • Privacy computing research: Explore privacy-preserving schemes for TEE and GPU collaboration;
  • Large model secure deployment: Secure inference in untrusted environments;
  • Federated learning: Reference for the inference side of distributed privacy-preserving training;
  • Enterprise AI deployment: Scenarios where model parameters and user data need protection;
  • Academic writing: Provide experimental data and technical details for support.
8

Section 08

Summary: Significance of the Research Prototype

Privacy LLM Inference constructs a privacy-preserving large model inference scheme based on mask obfuscation through systematic phased verification. Although it is a research prototype, its rigorous mathematical design and complete experimental verification process provide valuable technical references for the cutting-edge field of integrating privacy computing and large models.