Reading

PII Data Desensitization Practice: Comparison of Fine-tuned BERT and Zero-shot LLM Dual-track Solutions

This article introduces a complete Personal Identifiable Information (PII) detection and desensitization system. By comparing two technical approaches—fine-tuned BERT model and zero-shot LLM prompt engineering—it demonstrates how to achieve high-precision automatic recognition and desensitization of names and email addresses in real-world scenarios.

PII数据脱敏BERT命名实体识别LLM零样本学习隐私保护NLP

Published 2026-04-17 20:40Recent activity 2026-04-17 20:48Estimated read 6 min

PII Data Desensitization Practice: Comparison of Fine-tuned BERT and Zero-shot LLM Dual-track Solutions

Section 01

Introduction: Practice of Comparing Dual-track PII Data Desensitization Solutions

This article introduces a complete PII detection and desensitization system, comparing two technical approaches—fine-tuned BERT model and zero-shot LLM prompt engineering. It shows how to achieve high-precision recognition and desensitization of names and email addresses in real-world scenarios, providing engineering practice references for PII desensitization.

Section 02

Background and Problem Definition

Personal Identifiable Information (PII) includes data that can identify individuals, such as names, emails, and phone numbers. It needs to be automatically desensitized in scenarios like log analysis, customer service records, and dataset publishing. Traditional rule-based methods have poor performance in name recognition, and manual review cannot handle large-scale data, making deep learning solutions the mainstream choice.

Section 03

Detailed Explanation of Dual-track Technical Solutions

Fine-tuned BERT Model

Based on bert-base-uncased fine-tuning, trained using the WikiNeural dataset, with synthetic email data augmentation (samples expanded from 28516 to 37205). Defined 5 label categories (O/B-PER/I-PER/B-EMAIL/I-EMAIL). Training configuration: 3 epochs, learning rate 2e-5, batch size 8, weight decay 0.01.

Zero-shot LLM Prompt Engineering

Selected the Qwen2.5-1.5B-Instruct model, achieved structured JSON output through few-shot prompting to avoid hallucination issues. Post-processing includes hallucination filtering, email repair, and regex fallback.

Section 04

Core Technical Innovations

Hybrid Inference Pipeline: The BERT solution uses a layered strategy of regex + neural network, balancing the determinism of rules and the generalization ability of the model;
Intelligent Tokenization Handling: Solves the problem of BERT subword tokenization breaking entity boundaries, ensuring alignment between labels and tokens;
Robustness Enhancement: The BERT side has confidence filtering and label correction; the LLM side has hallucination detection and text replacement mechanisms.

Section 05

Comparative Analysis of Experimental Results

Fine-tuned BERT Performance

99.53% accuracy (token-level), 96.98% precision, 97.31% recall, 97.15% F1 (entity-level), 0.25% false positive rate, 1.36% missing rate.

Zero-shot LLM Performance

Metric	Name (Strict)	Name (Partial)	Email
Precision	82.93%	86.99%	83.93%
Recall	51.78%	52.71%	100%
F1	63.75%	65.64%	91.26%

Comprehensive Comparison

Dimension	Fine-tuned BERT	Zero-shot LLM
Name F1	97.15%	65.64%
Email F1	>99%	91.26%
Requires Training	Yes (7 mins)	No
Inference Speed	Fast (~15 samples/sec)	Slow (~1 sample/sec)
Adaptability	Needs retraining	High
Hallucination Risk	None	Mitigated

Section 06

Error Pattern Analysis

Fine-tuned BERT Errors

False positives for common words (e.g., "No" misjudged as name); 2. Sensitivity to tokenization; 3. Missing unseen naming patterns.

Zero-shot LLM Errors

Low recall rate for names; 2. Inaccurate entity boundary recognition; 3. Confusion of email components; 4. Over-identification of non-name entities.

Section 07

Key Engineering Practice Points and Future Optimization

Engineering Practice

Data preparation: Data augmentation via command line (python main.py augment --email-ratio 0.5);
Training evaluation: Automated workflow (python main.py train/evaluate);
Production inference: Supports command line invocation (python main.py predict).

Future Directions

Hybrid system, constrained decoding, model upgrade (DeBERTa-v3), probability calibration, diverse email generation, active learning.

Section 08

Summary of Practical Application Value

The project provides a complete technical selection and implementation reference for PII desensitization: choose fine-tuned BERT for precision, zero-shot LLM for rapid validation. The code repository has a clear structure, suitable as a practical textbook for NER and desensitization technologies.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15