Zing Forum

Reading

Comparative Study of Traditional NLP vs. LLM in Privacy Policy Classification: Which One Prevails?

This article delves into a comparative study that uses the OPP-115 dataset to systematically compare the performance of traditional NLP machine learning models (TF-IDF + SVM) and large language models (LLM) in the multi-label classification task of privacy policies, revealing the advantages of classical methods in scenarios with class imbalance.

NLP隐私政策机器学习LLM多标签分类文本分类OPP-115SVMTF-IDF类别不平衡
Published 2026-05-27 06:46Recent activity 2026-05-27 06:50Estimated read 8 min
Comparative Study of Traditional NLP vs. LLM in Privacy Policy Classification: Which One Prevails?
1

Section 01

【Introduction】Core Summary of the Comparative Study Between Traditional NLP and LLM in Privacy Policy Classification

The core research topic of this article is to compare the performance of traditional NLP machine learning models (e.g., TF-IDF + SVM) and large language models (LLM) in the multi-label classification task of privacy policies. Using the classic OPP-115 dataset, the study focuses on the performance differences between models in scenarios with class imbalance, and finally reveals the significant advantages of traditional methods in this task. The study aims to answer: In privacy policy classification, which is better—traditional methods or LLM? This question involves multiple considerations such as technology selection, resource efficiency, interpretability, and deployment costs.

2

Section 02

Research Background and Problem Awareness

In the digital age, privacy policies are a standard feature of Internet services, but the lengthy and obscure text leads to widespread 'consent fatigue' among users. Automatic understanding and classification of privacy policies have become a topic of common concern in academia and industry. The core question of this study: In the multi-label classification task of privacy policies, which performs better—traditional NLP methods or LLM? This question not only relates to technology selection but also involves resource efficiency, interpretability, and actual deployment costs.

3

Section 03

Dataset: OPP-115 Privacy Policy Corpus

The study uses the OPP-115 (Online Privacy Policy 115) benchmark dataset, which contains privacy policy texts from 115 websites. It covers the following core categories through manual annotation:

  • First-party data collection and usage
  • Third-party data sharing and collection
  • Data retention policy
  • Do Not Track
  • Policy change notification This dataset is a multi-label classification problem and has a serious class imbalance, which poses a challenge to the model.
4

Section 04

Methodology: Parallel Comparative Experiment Design

Traditional NLP Pipeline

  1. Data Preprocessing: Lowercasing text, removing URLs/emails, cleaning special characters, tokenization, stopword removal, lemmatization
  2. Feature Extraction: TF-IDF vectorization (primary), Word2Vec word embedding, N-gram analysis
  3. Baseline Models: SVM with class weights, logistic regression, random forest

LLM Classification Method

The Orca Mini v9 1.1B Instruct model was selected, and two prompt strategies were tested:

  • Zero-shot prompt: Direct classification without examples
  • Few-shot prompt: Guided by providing annotated examples The impact of rule constraints (with or without) was also compared.
5

Section 05

Experimental Results: Traditional Models Outperform LLM Significantly

Traditional Model Performance

Weighted SVM achieved the best baseline performance:

  • Micro F1: 0.6865
  • Macro F1: 0.6854
  • Hamming Loss: 0.0893 (lower is better) Traditional methods effectively alleviate the problem of minority classes being ignored through weight adjustment in the case of class imbalance.

LLM Performance

LLM performance was inferior:

  • Zero-shot (with rules): Micro F1=0.2149, Hamming Loss=0.8217
  • Few-shot (with rules): Micro F1=0.2050, Hamming Loss=0.5455

Key Insights

In structured classification tasks with class imbalance, classical machine learning (SVM + TF-IDF) is significantly better than LLM prompt methods. Possible reasons include: insufficient domain specificity, hallucinations due to generative nature, preference for high-frequency classes caused by class imbalance, and model size limitations.

6

Section 06

Ethical Considerations and Practical Significance

The study explores the ethical dimensions of automated privacy policy analysis:

  1. Data Timeliness: OPP-115 was released in 2016 and does not cover new clauses such as AI training
  2. Misclassification Risk: Automatic systems may misinterpret key clauses, leading users to misjudge privacy risks
  3. Necessity of Human Supervision: Automated tools should assist rather than replace legal professionals' judgments
  4. Bias and Fairness: Biases in training data may be transmitted to the model, underestimating the privacy risks of certain services
7

Section 07

Implications and Outlook: Technology Selection Needs to Be Pragmatic, Focus on Task Characteristics

Implications

Not all tasks are suitable for large models. In structured, domain-specific, and class-imbalanced classification tasks, well-designed traditional methods are more cost-effective and reliable.

Outlook

  • Build updated datasets containing privacy clauses for the AI era
  • Explore fusion strategies between LLM and traditional methods (e.g., LLM data augmentation)
  • Develop interpretable privacy policy analysis tools to help users understand clauses Against the backdrop of stricter AI regulation, such technologies will become increasingly important. Technology selection should be based on actual data and task characteristics, rather than blindly following trends.