Reading

Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation

A PyTorch prototype project that explores privacy-preserving large model inference using masking and padding techniques, verifies the correctness of obfuscated execution of Transformer models in a simulated Trusted Execution Environment (TEE), and provides technical references for the integration of privacy computing and AI inference.

隐私计算TEE大模型Transformer掩码混淆GPT-2PyTorch安全推理KV Cache注意力机制

Published 2026-06-03 09:41Recent activity 2026-06-03 09:56Estimated read 7 min

Section 01

[Introduction] Privacy LLM Inference: A Privacy-Preserving Large Model Inference Scheme Based on Mask Obfuscation

Privacy LLM Inference is a PyTorch prototype project. Its core is to explore privacy-preserving large model inference using masking and padding techniques in a simulated Trusted Execution Environment (TEE), verify the correctness of obfuscated execution of Transformer models, and provide technical references for the integration of privacy computing and AI inference. Its goal is to solve the privacy protection problem of input data and model parameters during large model inference in untrusted GPU environments.

Section 02

Background and Core Concepts: Dual-Domain Execution Model and Security Boundaries

Dual-Domain Execution Model

Trusted Domain (SimulatedTEE)：Has privileges such as accessing plaintext input, generating masks/padding, managing LoRA adapters, generating compensation tensors, and performing output deobfuscation.
Untrusted Domain (UntrustedGPUExecutor)：Can only process obfuscated input, transformed weights/adapters, compensation tensors, and obfuscated output, and cannot recover plaintext.

Security Boundary Notes

The current version is a prototype for algebraic correctness verification, and does not provide real security isolation, side-channel protection, memory isolation, authentication, or production-grade TEE guarantees. The interface design is prepared for subsequent integration with real TEEs.

Section 03

Technical Implementation Evolution: From Basic Linear Layers to Full GPT-2 Verification

The project iterates in phases:

Stage1：Basic linear layer obfuscation verification;
Stage1-LoRA：Extend the obfuscation mechanism to LoRA adapters;
Stage2：Implement complete Transformer Block (including attention, residual connections, etc.);
Stage3：Support Prefill/Decode and KV Cache;
Stage4.x：Integrate HuggingFace and verify each module of GPT-2;
Stage5.0：Experimental verification (attention probes, workload analyzer).

Section 04

Key Technical Details: Mask Padding Mechanism and Attention/KV Cache Management

Mask and Padding Mechanism

Mask Mode：X_tilde = X·N_in, W_tilde = N_in⁻¹·W·N_out, Y = Y_tilde·N_out⁻¹;
Padding Mode：Introduce compensation tensor T, X_tilde=(X-T)·N_in, Y_tilde=X_tilde·W_tilde + C_T (C_T=T·W·N_out).

Attention Mask Propagation

Constrain N_Q·N_Kᵀ=I to ensure Q_tilde·K_tildeᵀ=Q·Kᵀ, so attention scores are computed in the plaintext space.

KV Cache Management

Each head maintains independent N_K/N_V; Prefill samples masks, Decode reuses them, ensuring the invariance of K_tilde=K·N_K and V_tilde=V·N_V.

Section 05

Experiments and Verification: Correctness and Performance Analysis

Experiment Scripts

run_experiment_summary.py：Re-execute verification for each phase and generate summary results (JSON/CSV/MD);
run_attention_experiments.py：Scan parameters like batch_size, seq_len, etc., to verify attention invariance.

Workload Analysis

Compare TEE/GPU cost models of five execution strategies: plain_hf_gpu, tslp_trusted_nonlinear_baseline, ours_current, ours_ideal_gpu_nonlinear, amulet_style_reference.

Section 06

Current Limitations and Disclaimer

Engineering Simplifications

Adopt simplifications like trusted LayerNorm and trusted GELU; do not implement full obfuscation, prioritizing end-to-end correctness verification.

Security Statement

Does not provide real TEE isolation, side-channel protection, memory isolation, or authentication mechanisms; not a production-ready solution.

Research Nature

Used to verify algebraic correctness and explore the feasibility of TEE+GPU collaborative architecture, providing references for production solutions.

Section 07

Applicable Scenarios and Value

The project has reference value for the following fields:

Privacy computing research: Explore privacy-preserving schemes for TEE and GPU collaboration;
Large model secure deployment: Secure inference in untrusted environments;
Federated learning: Reference for the inference side of distributed privacy-preserving training;
Enterprise AI deployment: Scenarios where model parameters and user data need protection;
Academic writing: Provide experimental data and technical details for support.

Section 08

Summary: Significance of the Research Prototype

Privacy LLM Inference constructs a privacy-preserving large model inference scheme based on mask obfuscation through systematic phased verification. Although it is a research prototype, its rigorous mathematical design and complete experimental verification process provide valuable technical references for the cutting-edge field of integrating privacy computing and large models.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49