Zing Forum

Reading

Empirical Study on Few-Shot Learning of Large Language Models in Biomedical Named Entity Recognition

A systematic evaluation of 18 models from 9 architecture families reveals performance patterns of large language models in chemical and disease entity recognition tasks, finding that 8B parameter models achieve the best balance between efficiency and effectiveness.

生物医学命名实体识别大语言模型少样本学习BC5CDR化学物识别疾病识别上下文学习模型效率
Published 2026-04-22 05:44Recent activity 2026-04-22 05:49Estimated read 6 min
Empirical Study on Few-Shot Learning of Large Language Models in Biomedical Named Entity Recognition
1

Section 01

[Introduction] Core Summary of the Empirical Study on Few-Shot Learning of Large Language Models in Biomedical Named Entity Recognition

This paper conducts a systematic evaluation of 18 models from 9 architecture families to explore the few-shot learning performance patterns of large language models (LLMs) in Biomedical Named Entity Recognition (BioNER) tasks. Key findings include: 8B parameter models achieve the best balance between efficiency and effectiveness; chemical entity recognition outperforms disease entity recognition; in-context learning has a saturation effect—excessively increasing examples may lead to performance degradation.

2

Section 02

Research Background and Challenges

Biomedical Named Entity Recognition (BioNER) is a core NLP task in the medical field, requiring accurate identification of chemical and disease entities. However, it faces issues such as complex morphology, numerous term variants, and ambiguity. Traditional methods rely on large amounts of manually labeled data and domain feature engineering. Few-shot learning with LLMs brings new possibilities, but the performance of LLMs in BioNER and its influencing factors (parameter scale, number of in-context examples, entity type, etc.) need systematic research.

3

Section 03

Experimental Design and Methods

This study evaluates 18 models (from 9 architecture families, with parameter sizes ranging from 1B to 70B) on the BC5CDR test set (500 articles). The vLLM inference engine and FastAPI middleware are used to ensure reproducibility. Seven in-context learning densities (k ∈ {0,1,2,4,8,16,32}) are designed, with micro-F1 as the main metric to calculate the recognition performance of chemical and disease entities separately.

4

Section 04

Key Finding: Balance Between Scale and Efficiency

Parameter scale is not the only determining factor. Meta-Llama-3.1-8B-Instruct (8B parameters) achieves an overall F1 score of 0.605, surpassing models with larger parameters (e.g., Qwen2.5-14B-Instruct, Yi-1.5-9B-Chat), indicating the importance of pre-training data quality and instruction tuning. The parameter jump from 8B to 70B only brings a 2-3 F1 point improvement, making the 8B model the Pareto optimal choice in hardware-constrained environments.

5

Section 05

Asymmetry in Entity Recognition and In-Context Saturation Effect

Asymmetry: All models perform better in chemical entity recognition than disease entity recognition (chemical F1 range: 0.14-0.78; disease F1 range:0.05-0.51). This is because chemical names follow regular naming patterns, while disease mentions require deeper semantic abstraction and disambiguation.

Saturation Effect: Few-shot examples improve performance but have a threshold—beyond which performance plateaus or declines (e.g., gemma-1.1-2b-it's F1 drops by 74.6% from k=8 to k=32). Models with more than 7B parameters show smaller performance decay (≤6%), with Qwen2.5-14B-Instruct having the highest stability (Δ=-0.3%).

6

Section 06

Error Pattern Analysis and Technical Implementation

Error Patterns: False negatives are far more common than false positives. Small architectures and high k values amplify omission bias (especially for disease categories), suggesting that decision thresholds need to be adjusted to balance precision and recall.

Technical Implementation: The project provides a complete experimental framework (FastAPI middleware, multi-model consensus engine, evaluation pipeline, visualization tools) that supports multi-LLM consensus mechanisms such as voting, weighting, and cascading.

7

Section 07

Practical Implications and Future Directions

Practical Implications: The 8B model is the best balance between efficiency and effectiveness; LLMs can be directly used for chemical entity recognition, while disease recognition requires additional domain adaptation; the number of in-context examples needs to be tuned for each model to avoid over-filling.

Future Directions: Explore multi-model integration, domain-specific prompt engineering, and post-processing mechanisms combined with knowledge graphs to enhance the practical value of BioNER.