Zing Forum

Reading

The Old vs New Debate in Named Entity Recognition: A Comprehensive Comparative Study of Encoder Models and Generative Large Language Models

A bachelor's thesis-level systematic comparative study that deeply compares the performance, efficiency, and robustness differences between traditional encoder architectures (DeBERTa) and LoRA-fine-tuned generative large language models (Qwen3.5) in Named Entity Recognition tasks, providing empirical evidence for model selection in real-world application scenarios.

命名实体识别NERDeBERTaQwenLoRAQLoRA大语言模型编码器对比研究自然语言处理
Published 2026-04-07 08:41Recent activity 2026-04-07 15:14Estimated read 7 min
The Old vs New Debate in Named Entity Recognition: A Comprehensive Comparative Study of Encoder Models and Generative Large Language Models
1

Section 01

【Introduction】Comparison of Old and New NER Solutions: Encoder Models vs Generative Large Language Models

This is a bachelor's thesis-level systematic comparative study that deeply compares the performance, efficiency, and robustness differences between traditional encoder architectures (DeBERTa) and LoRA/QLoRA-fine-tuned generative large language models (Qwen3.5) in Named Entity Recognition (NER) tasks. It aims to provide empirical evidence for model selection in real-world application scenarios. The study covers the essential differences between the two technical routes, experimental design, key findings, and application recommendations, offering references for practitioners to understand technical trade-offs.

2

Section 02

Research Background: The Importance of NER Tasks and the Debate Over Technical Routes

NER is a fundamental task in natural language processing, whose core is to identify and classify entities with specific meanings from unstructured text (e.g., "Berlin" → LOC, "OpenAI" → ORG), supporting downstream applications such as information extraction and knowledge graph construction. For a long time, encoder architectures (e.g., BERT, DeBERTa) have been the mainstream solution for NER, but the rise of generative large language models has raised a new question: in resource-constrained environments, should we continue to use mature encoders or switch to flexible generative models?

3

Section 03

Technical Routes and Experimental Design: Essential Differences Between the Two Solutions and Evaluation Framework

Differences in Technical Routes

  • Encoder Solution: Treats NER as a token-level classification task, outputting BIO tags (e.g., B-LOC indicates the start of an entity). Its advantages are deterministic output, low inference latency, and controllable memory usage, but it only supports fixed label sets.
  • Generative Solution: Converts NER into a text generation task, outputting entity lists in JSON format. Its advantage is flexibility (can dynamically specify entity types), but there are risks of parsing failure and hallucinated entities.

Experimental Design

  • Datasets: MultiNERD (15 entity types), WNUT-17 (6 entity types).
  • Evaluation Metrics: Task quality (entity-level precision, recall, F1), efficiency (training time, inference latency, memory usage), model scale (trainable parameters/total parameters), robustness (JSON parsing failure rate of generative models).
  • Model Configurations: Encoder uses DeBERTa-v3-base/large; generative models use Qwen3.5-4B/27B (fine-tuned with QLoRA, 4-bit quantization, LoRA rank 16/32).

Technical Implementation

A modular architecture is adopted: the data layer handles different input formats, the training layer uses Trainer (for encoders) and SFTTrainer (for generative models), the inference layer outputs structured results, and the evaluation layer calculates multi-dimensional metrics.

4

Section 04

Key Findings: Trade-offs Between Performance, Efficiency, and Robustness

  1. Performance-Efficiency Trade-off: Encoders excel in inference speed and output determinism (no parsing step, zero failure rate); generative models are flexible but add parsing overhead and failure risks.
  2. Value of Parameter-Efficient Fine-Tuning: LoRA/QLoRA technologies significantly reduce memory requirements, enabling fine-tuning of 27B models on consumer GPUs, balancing performance and resource consumption.
  3. Selection of Validation Metrics: Generative NER should use entity recognition F1 instead of language model loss (perplexity) as the validation criterion to avoid target misalignment.
5

Section 05

Application Recommendations and Conclusion: Choose the Right Solution Based on Scenarios

Application Scenario Recommendations

  • High-throughput real-time systems (e.g., search engines): Prioritize encoder solutions (low latency, deterministic output).
  • Flexible extraction needs (custom entity types): Choose generative solutions, but need to handle parsing robustness.
  • Resource-constrained environments: QLoRA-fine-tuned small-to-medium LLMs (e.g., 4B parameters) are a compromise choice.

Conclusion

This study provides quantitative comparison data and a reproducible framework for the two solutions. Future technical boundaries may evolve, but practitioners should choose solutions based on scenarios rather than blindly following trends.