# Yomotsusaka: A Privacy Data Firewall Solution for Agent Workflows

> Introducing the Yomotsusaka project, a privacy data firewall designed for agent workflows. It uses open-source large language models for batch preprocessing to desensitize private documents into searchable lists and controlled keys.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-22T15:45:34.000Z
- 最近活动: 2026-05-22T15:52:09.953Z
- 热度: 150.9
- 关键词: 隐私保护, 数据脱敏, 智能体工作流, 本地LLM, 开源模型, 文档处理, PII保护, 数据安全
- 页面链接: https://www.zingnex.cn/en/forum/thread/yomotsusaka
- Canonical: https://www.zingnex.cn/forum/thread/yomotsusaka
- Markdown 来源: floors_fallback

---

## Introduction: Yomotsusaka - Privacy Data Firewall for Agent Workflows

Yomotsusaka is a privacy data firewall project specifically designed for agent workflows. Its core idea is to use locally-run open-source large language models to perform batch preprocessing on private documents, desensitizing original documents into "searchable lists" and "controlled keys". This protects the privacy of sensitive data while preserving the documents' value for agent retrieval and analysis, balancing privacy security and AI capability utilization.

## Privacy Challenges in the Agent Era

With the popularization of LLM-driven agent systems, protecting sensitive data while utilizing AI capabilities has become a key issue. Directly uploading original documents to cloud LLMs carries risks such as data leakage, compliance pressures (e.g., GDPR, HIPAA), and blurred trust boundaries. Traditional desensitization methods either remove too much information (rendering documents inactive) or retain risky information, making it difficult to balance privacy and usability.

## Yomotsusaka Project Overview and Design Philosophy

Yomotsusaka (黄泉坂) is a privacy data firewall for agent workflows, with the core concept of using local open-source LLMs to preprocess private documents. Its design philosophy includes: local-first (sensitive data preprocessing is executed locally), layered desensitization (different strategies based on sensitivity), verifiability (transparent and auditable desensitization process), and an "best-effort" strategy to balance practicality and security.

## Core Mechanisms: Document Desensitization and Controlled Key System

### Document Desensitization Process
1. **Entity Recognition and Classification**: Locally-run open-source LLMs identify PII (names, ID numbers, etc.), organization-sensitive information, and domain-specific content;
2. **Entity Replacement and Reference**: Sensitive entities are replaced with desensitized identifiers (e.g., `[PERSON_1]`), establishing a controlled key mapping of "original value → desensitized identifier → access control policy";
3. **Searchable List Generation**: Desensitized documents are converted into structured lists containing semantic summaries, key topics, relationship graphs, and timelines, preserving retrieval and analysis value.

### Controlled Key System
- Key Hierarchy: Classified into public/internal/confidential/top-secret levels based on sensitivity;
- Access Control: Combined RBAC+ABAC; only authorized parties can access;
- Audit Logs: Records all key access operations;
- Key Rotation: Regularly update mappings to limit the scope of leakage impact.

## Application Scenarios: Multi-Domain Privacy Protection Practices

Yomotsusaka can be applied in multiple domains:
- **Enterprise Knowledge Base Retrieval**: Process internal documents and store them in vector databases; employees can query via natural language without sensitive information leakage risks;
- **Medical Document Analysis**: After desensitizing medical records, AI assists doctors in diagnostic references and drug interaction checks;
- **Legal Document Review**: Process contracts/case materials; AI aids in clause analysis and risk assessment;
- **Financial Compliance Review**: Process transaction records/customer data; AI performs abnormal transaction detection and anti-money laundering analysis.

## Key Technical Implementation Points

### Local Model Selection
Use open-source models for local inference; common choices: Llama series (general-purpose), Mistral series (excellent performance), Phi series (small and efficient). Selection criteria: model capability, inference efficiency, open-source license.

### Batch Processing Architecture
- Document Chunking: Split large documents into model-processable fragments;
- Parallel Processing: Multi-core CPU/GPU parallel processing of multiple documents;
- Incremental Update: Support incremental processing of new documents without reprocessing the entire library.

### Agent Integration
Can integrate with mainstream frameworks: LangChain (document loading/post-processing), LlamaIndex (node converter), custom agents (API calls).

## Limitations and Future Development Directions

### Limitations
- "Best-effort" strategy: Cannot guarantee absolute privacy; attackers may recover information via side channels/statistical inference;
- Model capability limitations: Local open-source models may have lower entity recognition accuracy than cloud models;
- Performance overhead: Local inference requires sufficient computing resources; large-scale processing is time-consuming;
- Complex key management: Increases architectural complexity; needs proper management.

### Future Directions
- Integrate differential privacy technology to enhance mathematical guarantees;
- Combine federated learning to support distributed privacy training;
- Use TEE (Trusted Execution Environment) to improve security;
- Promote standardized interfaces for privacy-protected document processing.
