Zing Forum

Reading

Yomotsusaka: A Privacy Data Firewall Solution for Agent Workflows

Introducing the Yomotsusaka project, a privacy data firewall designed for agent workflows. It uses open-source large language models for batch preprocessing to desensitize private documents into searchable lists and controlled keys.

隐私保护数据脱敏智能体工作流本地LLM开源模型文档处理PII保护数据安全
Published 2026-05-22 23:45Recent activity 2026-05-22 23:52Estimated read 8 min
Yomotsusaka: A Privacy Data Firewall Solution for Agent Workflows
1

Section 01

Introduction: Yomotsusaka - Privacy Data Firewall for Agent Workflows

Yomotsusaka is a privacy data firewall project specifically designed for agent workflows. Its core idea is to use locally-run open-source large language models to perform batch preprocessing on private documents, desensitizing original documents into "searchable lists" and "controlled keys". This protects the privacy of sensitive data while preserving the documents' value for agent retrieval and analysis, balancing privacy security and AI capability utilization.

2

Section 02

Privacy Challenges in the Agent Era

With the popularization of LLM-driven agent systems, protecting sensitive data while utilizing AI capabilities has become a key issue. Directly uploading original documents to cloud LLMs carries risks such as data leakage, compliance pressures (e.g., GDPR, HIPAA), and blurred trust boundaries. Traditional desensitization methods either remove too much information (rendering documents inactive) or retain risky information, making it difficult to balance privacy and usability.

3

Section 03

Yomotsusaka Project Overview and Design Philosophy

Yomotsusaka (黄泉坂) is a privacy data firewall for agent workflows, with the core concept of using local open-source LLMs to preprocess private documents. Its design philosophy includes: local-first (sensitive data preprocessing is executed locally), layered desensitization (different strategies based on sensitivity), verifiability (transparent and auditable desensitization process), and an "best-effort" strategy to balance practicality and security.

4

Section 04

Core Mechanisms: Document Desensitization and Controlled Key System

Document Desensitization Process

  1. Entity Recognition and Classification: Locally-run open-source LLMs identify PII (names, ID numbers, etc.), organization-sensitive information, and domain-specific content;
  2. Entity Replacement and Reference: Sensitive entities are replaced with desensitized identifiers (e.g., [PERSON_1]), establishing a controlled key mapping of "original value → desensitized identifier → access control policy";
  3. Searchable List Generation: Desensitized documents are converted into structured lists containing semantic summaries, key topics, relationship graphs, and timelines, preserving retrieval and analysis value.

Controlled Key System

  • Key Hierarchy: Classified into public/internal/confidential/top-secret levels based on sensitivity;
  • Access Control: Combined RBAC+ABAC; only authorized parties can access;
  • Audit Logs: Records all key access operations;
  • Key Rotation: Regularly update mappings to limit the scope of leakage impact.
5

Section 05

Application Scenarios: Multi-Domain Privacy Protection Practices

Yomotsusaka can be applied in multiple domains:

  • Enterprise Knowledge Base Retrieval: Process internal documents and store them in vector databases; employees can query via natural language without sensitive information leakage risks;
  • Medical Document Analysis: After desensitizing medical records, AI assists doctors in diagnostic references and drug interaction checks;
  • Legal Document Review: Process contracts/case materials; AI aids in clause analysis and risk assessment;
  • Financial Compliance Review: Process transaction records/customer data; AI performs abnormal transaction detection and anti-money laundering analysis.
6

Section 06

Key Technical Implementation Points

Local Model Selection

Use open-source models for local inference; common choices: Llama series (general-purpose), Mistral series (excellent performance), Phi series (small and efficient). Selection criteria: model capability, inference efficiency, open-source license.

Batch Processing Architecture

  • Document Chunking: Split large documents into model-processable fragments;
  • Parallel Processing: Multi-core CPU/GPU parallel processing of multiple documents;
  • Incremental Update: Support incremental processing of new documents without reprocessing the entire library.

Agent Integration

Can integrate with mainstream frameworks: LangChain (document loading/post-processing), LlamaIndex (node converter), custom agents (API calls).

7

Section 07

Limitations and Future Development Directions

Limitations

  • "Best-effort" strategy: Cannot guarantee absolute privacy; attackers may recover information via side channels/statistical inference;
  • Model capability limitations: Local open-source models may have lower entity recognition accuracy than cloud models;
  • Performance overhead: Local inference requires sufficient computing resources; large-scale processing is time-consuming;
  • Complex key management: Increases architectural complexity; needs proper management.

Future Directions

  • Integrate differential privacy technology to enhance mathematical guarantees;
  • Combine federated learning to support distributed privacy training;
  • Use TEE (Trusted Execution Environment) to improve security;
  • Promote standardized interfaces for privacy-protected document processing.