# PII Data Desensitization Practice: Comparison of Fine-tuned BERT and Zero-shot LLM Dual-track Solutions

> This article introduces a complete Personal Identifiable Information (PII) detection and desensitization system. By comparing two technical approaches—fine-tuned BERT model and zero-shot LLM prompt engineering—it demonstrates how to achieve high-precision automatic recognition and desensitization of names and email addresses in real-world scenarios.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-17T12:40:38.000Z
- 最近活动: 2026-04-17T12:48:20.907Z
- 热度: 159.9
- 关键词: PII, 数据脱敏, BERT, 命名实体识别, LLM, 零样本学习, 隐私保护, NLP
- 页面链接: https://www.zingnex.cn/en/forum/thread/pii-bertllm
- Canonical: https://www.zingnex.cn/forum/thread/pii-bertllm
- Markdown 来源: floors_fallback

---

## Introduction: Practice of Comparing Dual-track PII Data Desensitization Solutions

This article introduces a complete PII detection and desensitization system, comparing two technical approaches—fine-tuned BERT model and zero-shot LLM prompt engineering. It shows how to achieve high-precision recognition and desensitization of names and email addresses in real-world scenarios, providing engineering practice references for PII desensitization.

## Background and Problem Definition

Personal Identifiable Information (PII) includes data that can identify individuals, such as names, emails, and phone numbers. It needs to be automatically desensitized in scenarios like log analysis, customer service records, and dataset publishing. Traditional rule-based methods have poor performance in name recognition, and manual review cannot handle large-scale data, making deep learning solutions the mainstream choice.

## Detailed Explanation of Dual-track Technical Solutions

### Fine-tuned BERT Model
Based on bert-base-uncased fine-tuning, trained using the WikiNeural dataset, with synthetic email data augmentation (samples expanded from 28516 to 37205). Defined 5 label categories (O/B-PER/I-PER/B-EMAIL/I-EMAIL). Training configuration: 3 epochs, learning rate 2e-5, batch size 8, weight decay 0.01.

### Zero-shot LLM Prompt Engineering
Selected the Qwen2.5-1.5B-Instruct model, achieved structured JSON output through few-shot prompting to avoid hallucination issues. Post-processing includes hallucination filtering, email repair, and regex fallback.

## Core Technical Innovations

1. **Hybrid Inference Pipeline**: The BERT solution uses a layered strategy of regex + neural network, balancing the determinism of rules and the generalization ability of the model;
2. **Intelligent Tokenization Handling**: Solves the problem of BERT subword tokenization breaking entity boundaries, ensuring alignment between labels and tokens;
3. **Robustness Enhancement**: The BERT side has confidence filtering and label correction; the LLM side has hallucination detection and text replacement mechanisms.

## Comparative Analysis of Experimental Results

### Fine-tuned BERT Performance
- 99.53% accuracy (token-level), 96.98% precision, 97.31% recall, 97.15% F1 (entity-level), 0.25% false positive rate, 1.36% missing rate.

### Zero-shot LLM Performance
| Metric | Name (Strict) | Name (Partial) | Email |
|---|---|---|---|
| Precision | 82.93% | 86.99% | 83.93% |
| Recall | 51.78% | 52.71% | 100% |
| F1 | 63.75% | 65.64% | 91.26% |

### Comprehensive Comparison
| Dimension | Fine-tuned BERT | Zero-shot LLM |
|---|---|---|
| Name F1 | 97.15% | 65.64% |
| Email F1 | >99% | 91.26% |
| Requires Training | Yes (7 mins) | No |
| Inference Speed | Fast (~15 samples/sec) | Slow (~1 sample/sec) |
| Adaptability | Needs retraining | High |
| Hallucination Risk | None | Mitigated |

## Error Pattern Analysis

### Fine-tuned BERT Errors
1. False positives for common words (e.g., "No" misjudged as name); 2. Sensitivity to tokenization; 3. Missing unseen naming patterns.

### Zero-shot LLM Errors
1. Low recall rate for names; 2. Inaccurate entity boundary recognition; 3. Confusion of email components; 4. Over-identification of non-name entities.

## Key Engineering Practice Points and Future Optimization

### Engineering Practice
- Data preparation: Data augmentation via command line (`python main.py augment --email-ratio 0.5`);
- Training evaluation: Automated workflow (`python main.py train`/`evaluate`);
- Production inference: Supports command line invocation (`python main.py predict`).

### Future Directions
Hybrid system, constrained decoding, model upgrade (DeBERTa-v3), probability calibration, diverse email generation, active learning.

## Summary of Practical Application Value

The project provides a complete technical selection and implementation reference for PII desensitization: choose fine-tuned BERT for precision, zero-shot LLM for rapid validation. The code repository has a clear structure, suitable as a practical textbook for NER and desensitization technologies.
