# An Innovative Method for Detecting Personal Identifiable Information (PII) in Traffic Accident Narrative Texts Using Agent Workflows

> This article introduces an agent workflow based on large language models (LLMs) for detecting Personal Identifiable Information (PII) in traffic accident narrative texts, achieving a high-precision (F1 score of 0.87) local privacy-preserving processing solution.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-15T05:03:20.000Z
- 最近活动: 2026-04-20T02:16:58.936Z
- 热度: 77.0
- 关键词: PII检测, 智能体工作流, 交通事故分析, 隐私保护, 大语言模型, 混合架构, 数据脱敏
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2604-15369v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2604-15369v1
- Markdown 来源: floors_fallback

---

## [Introduction] An Innovative Method for Detecting PII in Traffic Accident Texts Using Agent Workflows

This article proposes an agent workflow based on large language models (LLMs) for detecting Personal Identifiable Information (PII) in traffic accident narrative texts. The solution adopts a hybrid architecture combining rule engines and LLMs to achieve local privacy-preserving processing, with an F1 score of 0.87. It effectively addresses the limitations of traditional methods in large-scale data processing and context-dependent PII recognition, balancing the needs of data utilization and privacy protection.

## Background and Challenges: The PII Dilemma in Traffic Accident Texts

Traffic accident report narratives contain key contextual information but are mixed with PII such as names, addresses, and license plates, which restricts the wide use of data. Manual detection cannot handle large-scale data, and existing rule-based solutions struggle to capture complex context-dependent PII (e.g., "Master Wang" in specific contexts). There is a need to balance data utilization and privacy regulation requirements.

## Core Solution: Hybrid Agent Workflow Architecture

**Hybrid Extractor**: Classify PII into structured types (phone numbers, emails, etc., quickly identified using the rule engine Microsoft Presidio) and context-dependent types (names, addresses, etc., processed using domain-adapted fine-tuned large language models), leveraging the strengths of both.

**Validator**: Adopt an agent architecture, filter false positives through an evidence reasoning mechanism, requiring the model to provide specific textual evidence supporting PII judgments to reduce false positive rates.

## Ensemble Learning and Local Deployment

For complex PII such as home addresses and alphanumeric identifiers, an ensemble learning strategy is introduced, which calls multiple LLM instances in parallel to synthesize outputs. The system is fully deployed locally, with all processing completed on-premises to ensure data sovereignty and privacy security.

## Experimental Results: High-Precision Performance Verification

Evaluated on a real traffic accident dataset, the precision is 0.82, recall is 0.94, F1 score is 0.87, and accuracy is 0.96, which is significantly better than baseline methods. Ablation experiments show that the integrated LLM extraction and validator components are particularly effective in improving the detection of complex PII.

## Application Value: Balancing Data Utilization and Privacy Protection

It provides a practical path for privacy processing of traffic accident data, enabling research institutions to use data in compliance and governments to balance data openness and protection, demonstrating the value of combining large language models with domain knowledge to solve complex problems.

## Insights and Outlook: Future Potential of Hybrid Architecture

In privacy-sensitive scenarios, hybrid architectures outperform single methods (pure rule-based approaches lack flexibility, while pure model outputs are unpredictable). In the future, it can be extended to scenarios such as medical records and legal documents, with the expectation of harmonious unification of privacy protection and data utilization.
