Reading

Empirical Study on Few-Shot Learning of Large Language Models in Biomedical Named Entity Recognition

A systematic evaluation of 18 models from 9 architecture families reveals performance patterns of large language models in chemical and disease entity recognition tasks, finding that 8B parameter models achieve the best balance between efficiency and effectiveness.

生物医学命名实体识别大语言模型少样本学习BC5CDR化学物识别疾病识别上下文学习模型效率

Published 2026-04-22 05:44Recent activity 2026-04-22 05:49Estimated read 6 min

Section 01

[Introduction] Core Summary of the Empirical Study on Few-Shot Learning of Large Language Models in Biomedical Named Entity Recognition

This paper conducts a systematic evaluation of 18 models from 9 architecture families to explore the few-shot learning performance patterns of large language models (LLMs) in Biomedical Named Entity Recognition (BioNER) tasks. Key findings include: 8B parameter models achieve the best balance between efficiency and effectiveness; chemical entity recognition outperforms disease entity recognition; in-context learning has a saturation effect—excessively increasing examples may lead to performance degradation.

Section 02

Research Background and Challenges

Biomedical Named Entity Recognition (BioNER) is a core NLP task in the medical field, requiring accurate identification of chemical and disease entities. However, it faces issues such as complex morphology, numerous term variants, and ambiguity. Traditional methods rely on large amounts of manually labeled data and domain feature engineering. Few-shot learning with LLMs brings new possibilities, but the performance of LLMs in BioNER and its influencing factors (parameter scale, number of in-context examples, entity type, etc.) need systematic research.

Section 03

Experimental Design and Methods

This study evaluates 18 models (from 9 architecture families, with parameter sizes ranging from 1B to 70B) on the BC5CDR test set (500 articles). The vLLM inference engine and FastAPI middleware are used to ensure reproducibility. Seven in-context learning densities (k ∈ {0,1,2,4,8,16,32}) are designed, with micro-F1 as the main metric to calculate the recognition performance of chemical and disease entities separately.

Section 04

Key Finding: Balance Between Scale and Efficiency

Parameter scale is not the only determining factor. Meta-Llama-3.1-8B-Instruct (8B parameters) achieves an overall F1 score of 0.605, surpassing models with larger parameters (e.g., Qwen2.5-14B-Instruct, Yi-1.5-9B-Chat), indicating the importance of pre-training data quality and instruction tuning. The parameter jump from 8B to 70B only brings a 2-3 F1 point improvement, making the 8B model the Pareto optimal choice in hardware-constrained environments.

Section 05

Asymmetry in Entity Recognition and In-Context Saturation Effect

Asymmetry: All models perform better in chemical entity recognition than disease entity recognition (chemical F1 range: 0.14-0.78; disease F1 range:0.05-0.51). This is because chemical names follow regular naming patterns, while disease mentions require deeper semantic abstraction and disambiguation.

Saturation Effect: Few-shot examples improve performance but have a threshold—beyond which performance plateaus or declines (e.g., gemma-1.1-2b-it's F1 drops by 74.6% from k=8 to k=32). Models with more than 7B parameters show smaller performance decay (≤6%), with Qwen2.5-14B-Instruct having the highest stability (Δ=-0.3%).

Section 06

Error Pattern Analysis and Technical Implementation

Error Patterns: False negatives are far more common than false positives. Small architectures and high k values amplify omission bias (especially for disease categories), suggesting that decision thresholds need to be adjusted to balance precision and recall.

Technical Implementation: The project provides a complete experimental framework (FastAPI middleware, multi-model consensus engine, evaluation pipeline, visualization tools) that supports multi-LLM consensus mechanisms such as voting, weighting, and cascading.

Section 07

Practical Implications and Future Directions

Practical Implications: The 8B model is the best balance between efficiency and effectiveness; LLMs can be directly used for chemical entity recognition, while disease recognition requires additional domain adaptation; the number of in-context examples needs to be tuned for each model to avoid over-filling.

Future Directions: Explore multi-model integration, domain-specific prompt engineering, and post-processing mechanisms combined with knowledge graphs to enhance the practical value of BioNER.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Building an AWS Generative AI Application from Scratch: EC2 + Bedrock Hands-On Tutorial

A complete cloud-native AI application development guide for beginners, building a simple generative AI chatbot using Amazon EC2, Apache, Python CGI, and Amazon Bedrock, covering architecture design, IAM permission configuration, security best practices, and cost optimization suggestions.

Recent activity 2026-06-02 19:49