Reading

SharedRequest: A Batch-Level Privacy-Preserving Inference Framework with 5x Cost Reduction

SharedRequest reduces query costs by 5x while protecting user prompt privacy through batch-level privacy preservation and semantic instruction grouping, without the need to modify model architectures.

隐私保护模型推理差分隐私批量处理LLM安全

Published 2026-06-03 23:23Recent activity 2026-06-04 13:20Estimated read 7 min

SharedRequest: A Batch-Level Privacy-Preserving Inference Framework with 5x Cost Reduction

Section 01

SharedRequest: Core Guide to the Batch-Level Privacy-Preserving Inference Framework

SharedRequest is a batch-level privacy-preserving inference framework. It reduces query costs by 5x while protecting user prompt privacy through semantic instruction grouping and batch-level privacy preservation mechanisms, without modifying model architectures. Its core idea is to shift privacy protection from the single-prompt level to the batch level, achieving a balance between privacy, utility, and efficiency. It is applicable to various LLMs (including closed-source APIs and open-source models).

Section 02

Problem Background: The Dilemma of Privacy-Preserving Inference

With the widespread application of public LLMs like ChatGPT, the risk of user prompt privacy leakage has become increasingly prominent. Existing privacy-preserving inference methods have many issues: differential privacy adds noise, sacrificing output utility; homomorphic encryption/secure multi-party computation has huge overhead; model-specific solutions require architecture modifications and lack generality. How to protect privacy while maintaining efficiency, generality, and not affecting output quality has become a challenge.

Section 03

Core Methods of SharedRequest: Batch-Level Privacy Preservation and Semantic Grouping

Core Idea

Shift privacy protection from the single-prompt level to the batch level, amortize costs through semantic equivalent instruction grouping, and obfuscate sensitive information by mixing noise variants.

Technical Mechanisms

Semantic Instruction Grouping: Identify semantically similar queries and group them together, sharing instruction templates;
Noise Mixing Obfuscation: Generate multiple noise variants and mix them with the original prompt to protect the real content;
Batch Amortized Inference: Process the shared instruction part in batches, efficiently deliver personalized content, and linearly amortize costs.

Model Agnosticism

No need to access model parameters or modify architectures; it runs as a black-box API wrapper layer, can be seamlessly integrated into existing workflows, and is applicable to closed-source APIs, open-source hosting services, and privately deployed models.

Section 04

Experimental Results: Win-Win Verification of Privacy and Efficiency

Utility Improvement

Compared to traditional differential privacy baselines, output quality is improved by over 20%, and semantic coherence is close to the unprotected baseline.

Cost Reduction

Query costs are reduced by up to 5x (significant in large-batch scenarios), latency is optimized (reducing network round trips), and throughput is improved.

Privacy Strength

The noise mixing mechanism effectively defends against external eavesdroppers, supports privacy-utility trade-offs, and conforms to the differential privacy theoretical framework.

Section 05

Application Scenarios and Deployment Considerations

Applicable Scenarios

Enterprise-level API proxies (privacy-protected access for internal employees);
Privacy-sensitive industries such as healthcare, finance, and law;
High-concurrency public interfaces;
General solutions for multi-cloud deployment.

Deployment Recommendations

Tune batch size (balance latency and cost);
Optimize domain-specific semantic grouping strategies;
Calibrate noise intensity (match privacy requirements);
Establish privacy protection effect monitoring and auditing mechanisms.

Section 06

Limitations and Future Research Directions

Current Limitations

Batch processing may introduce latency that affects real-time applications;
The accuracy of semantic grouping for complex queries needs improvement;
Need to defend against advanced adversarial attacks targeting specific patterns.

Future Directions

Adaptive batch strategy (dynamically adjust size);
Hierarchical privacy protection (differentiated processing of sensitive content);
Integration with federated learning;
Hardware acceleration to improve batch processing efficiency.

Section 07

Conclusion: The Value and Significance of SharedRequest

SharedRequest represents an important advancement in privacy-preserving LLM inference, balancing privacy, utility, and efficiency through batch-level privacy protection. In today's era where data privacy is valued, its model-agnostic, efficient, and practical features have important application value for organizations that need to deploy LLMs at scale and meet privacy compliance requirements, providing a technical path worth considering.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49