Zing Forum

Reading

Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval

An academic research project comparing the performance of four open-source large language models (GPT, Llama, OLMo, Pythia) in RAG pipelines for microbiome/biodiversity sample data retrieval tasks, with comprehensive evaluation using the RAGAS framework.

RAG大语言模型微生物组生物多样性NCBIFAISSRAGAS评估开源模型科学数据检索
Published 2026-04-30 17:36Recent activity 2026-04-30 17:58Estimated read 14 min
Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval
1

Section 01

【Introduction】Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval

Core Research Content

This research project focuses on the performance of open-source large language models in RAG pipelines for microbiome/biodiversity sample data retrieval tasks. It compares four representative models (OpenAI GPT, Meta Llama 3.2-1B, AI2 OLMo, EleutherAI Pythia-2) and uses the RAGAS framework for comprehensive evaluation.

Research Objectives

To explore the performance differences of different large language models in RAG tasks, and whether open-source models can meet the knowledge retrieval needs of professional fields, providing references for practical deployment decisions.

Key Technical Components

A shared retrieval layer (FAISS vector search, DuckDB structured storage, intfloat/multilingual-e5-large embedding model) ensures evaluation fairness. Data sources include metadata, taxonomic information from VDP tables and the Fujita dataset.

2

Section 02

Research Background: Intelligent Needs for Biological Data Retrieval

Microbiome research and biodiversity analysis generate massive structured data, including sample metadata, taxonomic information, and feature tables. These data are usually stored in complex tables and databases. For non-professional users or scenarios requiring quick information access, traditional SQL queries or manual retrieval methods are inefficient.

Retrieval-Augmented Generation (RAG) technology provides a new possibility to solve this problem. By converting natural language queries into precise retrieval of structured data, RAG systems allow researchers to ask questions in daily language and get accurate answers based on actual data. However, how do different large language models perform in RAG tasks? Can open-source models胜任 professional field knowledge retrieval?

This 2025 academic research project aims to answer these questions. It builds a complete RAG pipeline, evaluates four different open-source large language models on microbiome/biodiversity datasets, and uses the RAGAS framework for comprehensive quality assessment.

3

Section 03

Project Overview: Multi-Model RAG Benchmarking

The core of this research project is a comparable RAG pipeline architecture where all models share the same retrieval layer and only replace the generation component. This design ensures the fairness of evaluation results—any performance difference can be attributed to the model's own characteristics rather than changes in retrieval quality.

Four Models Evaluated

The study selected four representative open-source or open-access models:

  1. OpenAI GPT: A commercial model accessed via API, used as a performance benchmark
  2. Meta Llama 3.2-1B: A lightweight open-source model suitable for resource-constrained environments
  3. AI2 OLMo: A fully open-source academic model with transparent training process
  4. EleutherAI Pythia-2: A member of the research-oriented model family

This diverse model selection covers different orientations such as commercial API, lightweight open-source, and academic transparency, providing comprehensive references for practical deployment decisions.

Shared Retrieval Architecture

All RAG pipelines share the following technical components:

  • Vector search: FAISS for efficient similarity retrieval
  • Structured metadata storage: DuckDB for sample metadata queries
  • Embedding model: intfloat/multilingual-e5-large, supporting multilingual text
  • Data sources: Sample metadata, taxonomic information, and feature tables from VDP tables and the Fujita dataset

This unified infrastructure design ensures the comparability of experiments.

4

Section 04

Technical Implementation: From Data Preparation to Evaluation Process

The project's codebase clearly organizes each stage of the RAG pipeline:

Data Preparation Phase

The download_prep.py script handles initial data preparation:

  • Download NCBI Taxonomy ontology (about 1GB)
  • Build FAISS indexes for 9 different sentence transformer encoders
  • Cache results as .pkl and .index files to speed up subsequent runs

This step is the foundation of the entire process. The NCBI ontology provides an authoritative taxonomic knowledge base, and the index construction of multiple encoders allows selecting the optimal embedding model in subsequent steps.

Data Ingestion Pipeline

The full_pipeline2.ipynb notebook executes the core data ingestion process:

  • Load and normalize sample metadata from VDP and Fujita tables
  • Use FAISS similarity search to match taxonomic names in samples with the NCBI ontology
  • Generate embedding vectors and store them in DuckDB (document_vectors.db) and FAISS (docs.faiss)

RAG Pipeline Implementation

Each model has an independent RAG implementation script (RAGgpt.py, RAGllama.py, RAGolmo.py, RAGpythia2.py). These scripts share the same database connection and retrieval logic but use different models for answer generation.

RAGAS Evaluation Framework

RAGASeval.py provides a unified evaluation tool using five core metrics of the RAGAS framework:

Metric Meaning
Faithfulness Are the claims in the answer based on the retrieved context?
Answer Relevancy Is the answer relevant to the question?
Context Recall Does the retrieved context cover the true answer?
Context Precision Are there no irrelevant paragraphs in the retrieved context?
Answer Correctness How close is the answer to the true answer?

These five metrics comprehensively evaluate the performance of the RAG system from two dimensions: retrieval quality and generation quality.

5

Section 05

Configuration and Customizability

The project designs a flexible configuration mechanism where key parameters can be adjusted during initialization:

  • DuckDB path: document_vectors.db
  • FAISS index path: docs.faiss
  • Embedding model: intfloat/multilingual-e5-large
  • Retrieval Top-K: Default 50
  • OpenAI API key: Environment variable OPENAI_API_KEY

This design allows researchers to easily try different embedding models, adjust retrieval parameters, or adapt to different datasets.

6

Section 06

Research Value and Application Scenarios

This research has reference value for multiple groups:

Academic Research

For scholars engaged in bioinformatics, microbiome research, or biodiversity analysis, this project demonstrates how to apply modern NLP technology to professional field data retrieval. The use of the RAGAS evaluation framework also provides methodological references for similar studies.

Enterprise Applications

For enterprises that need to process large amounts of structured scientific data (such as pharmaceutical companies, agricultural technology companies), this research compares the performance gap between commercial APIs and open-source models, providing data support for technology selection decisions.

Model Developers

For open-source model developers, this research provides benchmark test results in a specific vertical field (biological data retrieval), helping to identify the model's strengths and improvement directions.

Educational Use

The Jupyter notebook format and clear code structure of the project make it an excellent teaching resource for learning RAG system construction and evaluation.

7

Section 07

Limitations and Future Research Directions

As an academic research project, it has some limitations:

  • Dataset size: Uses relatively small VDP and Fujita datasets; performance on larger-scale data needs to be verified
  • Domain specificity: Focuses on microbiome/biodiversity data; migration to other scientific fields requires additional work
  • Model scope: Only evaluates four models; a wider range of model comparisons (including newer model versions) will provide more valuable insights

Future research directions may include:

  • Expanding to more diverse biomedical datasets
  • Introducing more advanced retrieval technologies (such as multi-vector retrieval, query rewriting)
  • Exploring fine-tuning strategies to improve domain-specific performance
  • Developing an interactive web interface to lower the usage threshold
8

Section 08

Conclusion

This research project provides valuable empirical data for the application of open-source large language models in professional scientific data retrieval. Through systematic RAGAS evaluation, it not only compares the performance of different models but also demonstrates how to build a reproducible and comparable RAG evaluation framework.

For developers and researchers considering applying RAG technology to vertical fields, this project provides a complete reference implementation covering the entire process from data preparation to quality evaluation. In today's era of rapid improvement in open-source model capabilities, such benchmark research has important practical significance for guiding actual technology selection.