Zing Forum

Reading

RAG Practice of Open-source LLMs in Biomedical Data Retrieval: A Multi-model Comparative Study

A master's thesis research project that builds a RAG system for microbiome sample data, compares the retrieval-augmented generation capabilities of four language models (GPT, Llama, OLMo, Pythia), and conducts multi-dimensional evaluation using the RAGAS framework.

RAGLLMbiomedicalmicrobiomeFAISSRAGASopen-source
Published 2026-04-30 17:36Recent activity 2026-04-30 17:50Estimated read 7 min
RAG Practice of Open-source LLMs in Biomedical Data Retrieval: A Multi-model Comparative Study
1

Section 01

【Introduction】RAG Practice of Open-source LLMs in Biomedical Data Retrieval: A Multi-model Comparative Study

This study builds a RAG system for microbiome sample data, compares the retrieval-augmented generation capabilities of four language models—GPT (closed-source), Llama, OLMo, and Pythia—and conducts multi-dimensional evaluation using the RAGAS framework. It aims to address the complexity of data querying in the biomedical field, explore the application potential of open-source LLMs in professional scenarios, and provide reusable RAG templates and evaluation methodologies for the field.

2

Section 02

Research Background and Motivation

In the biomedical field, there are complex challenges in managing and querying microbiome and biodiversity data: traditional database queries require professional SQL knowledge, natural language interfaces can lower the threshold, but direct application of LLMs has hallucination issues and lagging knowledge updates. RAG technology combines external knowledge bases with LLMs to balance naturalness and accuracy. This study explores the application of RAG in microbiome data querying and compares the performance differences between different open-source/closed-source models.

3

Section 03

System Architecture Design

The RAG system adopts a modular layered architecture:

  • Data Layer: Processes VDP sample metadata and Fujita biodiversity data, links to NCBI taxonomic ontology (1GB), uses FAISS to store species vectors (similarity retrieval), and DuckDB to store structured metadata.
  • Retrieval Layer: Uses the intfloat/multilingual-e5-large embedding model to extract key entities from user questions before retrieving relevant context.
  • Generation Layer: Compares four models: OpenAI GPT (closed-source), Meta Llama3.2-1B (lightweight open-source), AI2 OLMo (fully open-source), and EleutherAI Pythia-2 (research-oriented).
4

Section 04

RAGAS Evaluation Framework

The RAGAS framework is used for evaluation from five dimensions:

  1. Faithfulness: Whether the generated answer statements have contextual basis (detect hallucinations);
  2. Answer Relevance: Whether the answer directly responds to the question;
  3. Context Recall: Whether the retrieved context covers necessary information;
  4. Context Precision: The proportion of relevant information in the retrieval results;
  5. Answer Correctness: Semantic similarity with manually labeled standard answers (end-to-end metric).
5

Section 05

Implementation Details and Workflow

Workflow:

  1. Data Preparation: Run download_prep.py to download the NCBI ontology and build a FAISS index (only once);
  2. Data Ingestion: Load datasets via full_pipeline2.ipynb, parse species names, generate embeddings, and store them in DuckDB/FAISS;
  3. Model Inference: Four independent scripts (e.g., RAGgpt.py) correspond to different models—GPT requires an API key, others can run locally (Llama requires Hugging Face authorization);
  4. Evaluation: RAGASeval.py takes RAG instances, question lists, and standard answers, and outputs an evaluation report.
6

Section 06

Technical Selection Considerations

Technical selection balances practicality:

  • Choose FAISS instead of dedicated services: Reduce deployment complexity, suitable for academic scenarios;
  • Use DuckDB to process structured data: Lightweight and well-integrated with the Python ecosystem;
  • Multilingual embedding model: Supports mixed queries of Latin names and common names from multiple regions;
  • 10-fold leave-one-out tool cross-validation: Ensure model generalization ability and avoid overfitting to specific tool types.
7

Section 07

Application Value and Insights

Research Value:

  • Provide a reusable RAG implementation template for the bioinformatics field (clear code and complete documentation);
  • Demonstrate the potential of open-source LLMs: After tuning, their performance in vertical fields is acceptable, with both data privacy and cost advantages;
  • RAGAS framework reference: Provide a quality assurance methodology for RAG implementation in high-precision fields such as medicine and law.