Reading

Yomotsusaka: A Privacy Data Firewall Solution for Agent Workflows

Introducing the Yomotsusaka project, a privacy data firewall designed for agent workflows. It uses open-source large language models for batch preprocessing to desensitize private documents into searchable lists and controlled keys.

隐私保护数据脱敏智能体工作流本地LLM开源模型文档处理PII保护数据安全

Published 2026-05-22 23:45Recent activity 2026-05-22 23:52Estimated read 8 min

Yomotsusaka: A Privacy Data Firewall Solution for Agent Workflows

Section 01

Introduction: Yomotsusaka - Privacy Data Firewall for Agent Workflows

Yomotsusaka is a privacy data firewall project specifically designed for agent workflows. Its core idea is to use locally-run open-source large language models to perform batch preprocessing on private documents, desensitizing original documents into "searchable lists" and "controlled keys". This protects the privacy of sensitive data while preserving the documents' value for agent retrieval and analysis, balancing privacy security and AI capability utilization.

Section 02

Privacy Challenges in the Agent Era

With the popularization of LLM-driven agent systems, protecting sensitive data while utilizing AI capabilities has become a key issue. Directly uploading original documents to cloud LLMs carries risks such as data leakage, compliance pressures (e.g., GDPR, HIPAA), and blurred trust boundaries. Traditional desensitization methods either remove too much information (rendering documents inactive) or retain risky information, making it difficult to balance privacy and usability.

Section 03

Yomotsusaka Project Overview and Design Philosophy

Yomotsusaka (黄泉坂) is a privacy data firewall for agent workflows, with the core concept of using local open-source LLMs to preprocess private documents. Its design philosophy includes: local-first (sensitive data preprocessing is executed locally), layered desensitization (different strategies based on sensitivity), verifiability (transparent and auditable desensitization process), and an "best-effort" strategy to balance practicality and security.

Section 04

Core Mechanisms: Document Desensitization and Controlled Key System

Document Desensitization Process

Entity Recognition and Classification: Locally-run open-source LLMs identify PII (names, ID numbers, etc.), organization-sensitive information, and domain-specific content;
Entity Replacement and Reference: Sensitive entities are replaced with desensitized identifiers (e.g., [PERSON_1]), establishing a controlled key mapping of "original value → desensitized identifier → access control policy";
Searchable List Generation: Desensitized documents are converted into structured lists containing semantic summaries, key topics, relationship graphs, and timelines, preserving retrieval and analysis value.

Controlled Key System

Key Hierarchy: Classified into public/internal/confidential/top-secret levels based on sensitivity;
Access Control: Combined RBAC+ABAC; only authorized parties can access;
Audit Logs: Records all key access operations;
Key Rotation: Regularly update mappings to limit the scope of leakage impact.

Section 05

Application Scenarios: Multi-Domain Privacy Protection Practices

Yomotsusaka can be applied in multiple domains:

Enterprise Knowledge Base Retrieval: Process internal documents and store them in vector databases; employees can query via natural language without sensitive information leakage risks;
Medical Document Analysis: After desensitizing medical records, AI assists doctors in diagnostic references and drug interaction checks;
Legal Document Review: Process contracts/case materials; AI aids in clause analysis and risk assessment;
Financial Compliance Review: Process transaction records/customer data; AI performs abnormal transaction detection and anti-money laundering analysis.

Section 06

Key Technical Implementation Points

Local Model Selection

Use open-source models for local inference; common choices: Llama series (general-purpose), Mistral series (excellent performance), Phi series (small and efficient). Selection criteria: model capability, inference efficiency, open-source license.

Batch Processing Architecture

Document Chunking: Split large documents into model-processable fragments;
Parallel Processing: Multi-core CPU/GPU parallel processing of multiple documents;
Incremental Update: Support incremental processing of new documents without reprocessing the entire library.

Agent Integration

Can integrate with mainstream frameworks: LangChain (document loading/post-processing), LlamaIndex (node converter), custom agents (API calls).

Section 07

Limitations and Future Development Directions

Limitations

"Best-effort" strategy: Cannot guarantee absolute privacy; attackers may recover information via side channels/statistical inference;
Model capability limitations: Local open-source models may have lower entity recognition accuracy than cloud models;
Performance overhead: Local inference requires sufficient computing resources; large-scale processing is time-consuming;
Complex key management: Increases architectural complexity; needs proper management.

Future Directions

Integrate differential privacy technology to enhance mathematical guarantees;
Combine federated learning to support distributed privacy training;
Use TEE (Trusted Execution Environment) to improve security;
Promote standardized interfaces for privacy-protected document processing.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15