# Large Model Privacy Protection Dataset: Open Resource for PII Detection and Prompt Enhancement

> This is a privacy-aware prompt enhancement dataset designed specifically for LLM applications, containing 10,000 annotated samples, 75% of which are synthetically generated. It supports PII identification, classification, and anonymization, providing a training and evaluation benchmark for building privacy-preserving AI systems.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-18T04:42:35.000Z
- 最近活动: 2026-04-18T04:56:17.288Z
- 热度: 157.8
- 关键词: PII检测, 隐私保护, 提示词增强, 合成数据, LLM安全, 数据匿名化, 负责任AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/pii
- Canonical: https://www.zingnex.cn/forum/thread/pii
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: Large Model Privacy Protection Dataset: Open Resource for PII Detection and Prompt Enhancement

This is a privacy-aware prompt enhancement dataset designed specifically for LLM applications, containing 10,000 annotated samples, 75% of which are synthetically generated. It supports PII identification, classification, and anonymization, providing a training and evaluation benchmark for building privacy-preserving AI systems.

## Introduction: Privacy Challenges in the Era of Large Models

The widespread application of Large Language Models (LLMs) has brought unprecedented convenience, but also raised severe privacy protection issues. When interacting with AI systems, users often inadvertently leak Personally Identifiable Information (PII) in prompts, such as sensitive data like names, addresses, phone numbers, and ID numbers. Once this PII is memorized by the model or exposed during inference, it may lead to serious privacy leakage risks.

How to effectively identify and protect user privacy while maintaining model usability has become a core issue in responsible AI development. One of the responses from the open-source community is to build high-quality, reusable datasets to provide benchmarks for the research, development, and evaluation of privacy protection technologies.

## Dataset Overview

This dataset is specifically designed for PII detection and privacy-aware prompt enhancement in LLM applications, with the following core features:

## Scale and Composition

- **Total Sample Size**: 10,000 prompt samples
- **Synthetic Data Ratio**: 75% of samples are synthetically generated, ensuring data diversity and privacy security
- **Category Distribution**: 5,000 samples require anonymization (containing PII), 5,000 samples do not (clean data)
- **Subdivision per Category**: Each category contains 2,000 classification samples, of which 1,000 are used for anonymization tasks and 1,000 as clean reference prompts

## Data Format

The dataset is provided in CSV and Excel formats for use in different scenarios. Each record includes the following fields:

| Field Name | Description |
|--------|------|
| Original | Original user prompt |
| Need Anonymization | Whether anonymization is needed (YES/NO) |
| Detect PII Values | JSON-formatted PII detection results, including type and specific value |
| Improved Prompt | Improved prompt after removing sensitive information, preserving original meaning |

## Privacy Protection Driven by Synthetic Data

A notable feature of the dataset is the extensive use of synthetic data (accounting for 75% of the total). This design choice has multiple advantages:

**Avoid Real Privacy Leakage**

Using synthetic data completely avoids the privacy risks associated with using real user data, allowing researchers to share and publish the dataset with confidence without worrying about data leakage issues.

**Support Fair and Privacy-Preserving AI Research**

As a key driver of fair and privacy-preserving AI research, synthetic data enables researchers to develop and validate privacy protection technologies without accessing sensitive real data.

**Ensure Data Diversity**

Through carefully designed synthetic strategies, the dataset covers various PII types and scenarios, ensuring the generalization ability of the trained model.

## Dual Task Support

The structure of the dataset supports two core tasks:

**Binary Classification Task (PII vs Non-PII)**

Through the "Need Anonymization" field, PII detection models can be directly trained to determine whether the input prompt contains sensitive information that needs to be processed.

**Multi-Category Anonymization Analysis**

Through the JSON annotations in the "Detect PII Values" field, it supports fine-grained PII type identification (such as age, gender, address, phone number, etc.), providing supervision signals for multi-category classification and sequence labeling tasks.

## Examples of Anonymization Techniques

The anonymization techniques used in the dataset include:

- **Generalization**: Replace specific values with broader categories, e.g., replacing "25 years old" with "20-30 years old"
- **Pseudonymization**: Replace real identifiers with pseudonyms, maintaining data structure while removing identifiability
- **Masking**: Replace sensitive information with placeholders (e.g., [NAME], [PHONE])
- **Combination Strategy**: Flexibly combine the above techniques according to PII type and context