Reading

Traditional NLP vs. Large Language Models in Privacy Policy Classification: Which One Prevails?

A comparative study reveals how traditional machine learning models (TF-IDF + SVM) outperform large language models (LLMs) using zero-shot and few-shot prompting methods in privacy policy multi-label classification tasks, while also delving into the ethical challenges of automated privacy policy analysis.

隐私政策NLP机器学习大语言模型多标签分类SVMLLM文本分类数据隐私伦理AI

Published 2026-05-27 06:44Recent activity 2026-05-27 06:47Estimated read 7 min

Traditional NLP vs. Large Language Models in Privacy Policy Classification: Which One Prevails?

Section 01

【Introduction】Traditional NLP vs. LLM: Which is Better for Privacy Policy Classification?

A comparative study reveals: In privacy policy multi-label classification tasks, traditional machine learning models (TF-IDF + SVM) significantly outperform large language models (LLMs) using zero-shot and few-shot prompting methods. This study was published by Belnadino on GitHub (project name: PRIVACY-POLICY-NLP-CLASSIFIER), and also explores the ethical challenges of automated privacy policy analysis.

Section 02

Project Background and Research Motivation

OPP-115 Dataset: The Gold Standard for Privacy Policy Research

This project is based on the OPP-115 corpus created by Wilson et al. in 2016, which contains thousands of manually annotated privacy policy segments covering multiple categories such as first-party data collection, third-party sharing, and data retention. It features multi-label classification (a single text segment can belong to multiple categories) and severe class imbalance issues, posing challenges for models.

Research Motivation

In the digital age, privacy policies are lengthy and complex. The popularity of AI requires policies to address new dimensions such as the use of training data. Researchers explore NLP technologies for automated classification and compare the performance of traditional models and LLMs.

Section 03

Methodology: Experimental Design for Traditional NLP and LLMs

Traditional NLP Pipeline

Preprocessing: Lowercasing, removing URLs/special characters, tokenization/stopword removal, lemmatization
Feature Engineering: TF-IDF vectorization (primary), Word2Vec word embedding, N-gram analysis
Models: SVM (best), Logistic Regression, Random Forest

LLM Experimental Strategy

The Orca Mini v9 1.1B Instruct model was selected, testing:

Zero-shot prompting (no examples)
Few-shot prompting (a small number of annotated examples)
Rule-constrained variant (adding explicit classification rules to prompts)

Section 04

Experimental Results: Traditional Models Outperform LLMs by a Large Margin

Performance of Traditional Model (SVM)

Metric	Value
Micro F1	0.6865
Macro F1	0.6854
Hamming Loss	0.0893

LLM Performance

Zero-shot (with rules): Micro F1=0.2149, Hamming Loss=0.8217
Few-shot (with rules): Micro F1=0.2050, Hamming Loss=0.5455

The results show: Traditional models perform better in multi-label classification and handling class imbalance, while LLMs have high error rates.

Section 05

In-depth Analysis: Reasons for Traditional Models' Victory

Task Characteristics and Model Matching

Clear category definitions and distinct keyword features
Dense professional terminology, TF-IDF effectively captures weights
Relies on local keyword combinations, no need for long-range semantic understanding SVM+TF-IDF excels at classification in high-dimensional feature spaces, matching task requirements.

Bottlenecks of LLMs

Context length limitation: Difficult to distinguish multiple category definitions simultaneously
Sensitivity to class imbalance: Tends to favor high-frequency categories
Prompt design challenges: Simple strategies cannot unlock potential

Section 06

Ethical Considerations: Boundaries and Risks of Automated Analysis

Dataset Timeliness

OPP-115 was created in 2016 and cannot cover privacy policies of modern AI products (e.g., use of training data), so models may be outdated.

Consequences of Misclassification

Impaired user right to know: Misclassification of AI training clauses
Compliance risks: Enterprises relying on the system may violate GDPR/CCPA
Erosion of trust: Tool summaries do not match reality

Necessity of Human-Machine Collaboration

Automated tools can assist in screening, but final decisions require the participation of legal professionals.

Section 07

Practical Implications and Future Research Directions

Practical Implications

Applicable scenarios for traditional models: Structured classification, resource-constrained environments, high interpretability requirements, severe class imbalance
Value of LLMs: Open-domain understanding, few-shot adaptation, natural language generation, cross-language transfer

Research Limitations and Future Directions

Limitations: Only tested Orca Mini 1.1B, insufficient depth of prompt engineering, limited dataset representativeness
Future: Hybrid architecture (traditional + LLM), continuous learning, multimodal expansion, user studies to evaluate impact

Conclusion: Technology selection should be combined with task characteristics. Automated tools should not replace human judgment, and ethical responsibility should be emphasized.