Reading

Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval

An academic research project comparing the performance of four open-source large language models (GPT, Llama, OLMo, Pythia) in RAG pipelines for microbiome/biodiversity sample data retrieval tasks, with comprehensive evaluation using the RAGAS framework.

RAG大语言模型微生物组生物多样性NCBIFAISSRAGAS评估开源模型科学数据检索

Published 2026-04-30 17:36Recent activity 2026-04-30 17:58Estimated read 14 min

Section 01

【Introduction】Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval

Core Research Content

This research project focuses on the performance of open-source large language models in RAG pipelines for microbiome/biodiversity sample data retrieval tasks. It compares four representative models (OpenAI GPT, Meta Llama 3.2-1B, AI2 OLMo, EleutherAI Pythia-2) and uses the RAGAS framework for comprehensive evaluation.

Research Objectives

To explore the performance differences of different large language models in RAG tasks, and whether open-source models can meet the knowledge retrieval needs of professional fields, providing references for practical deployment decisions.

Key Technical Components

A shared retrieval layer (FAISS vector search, DuckDB structured storage, intfloat/multilingual-e5-large embedding model) ensures evaluation fairness. Data sources include metadata, taxonomic information from VDP tables and the Fujita dataset.

Section 02

Research Background: Intelligent Needs for Biological Data Retrieval

Microbiome research and biodiversity analysis generate massive structured data, including sample metadata, taxonomic information, and feature tables. These data are usually stored in complex tables and databases. For non-professional users or scenarios requiring quick information access, traditional SQL queries or manual retrieval methods are inefficient.

Retrieval-Augmented Generation (RAG) technology provides a new possibility to solve this problem. By converting natural language queries into precise retrieval of structured data, RAG systems allow researchers to ask questions in daily language and get accurate answers based on actual data. However, how do different large language models perform in RAG tasks? Can open-source models胜任 professional field knowledge retrieval?

This 2025 academic research project aims to answer these questions. It builds a complete RAG pipeline, evaluates four different open-source large language models on microbiome/biodiversity datasets, and uses the RAGAS framework for comprehensive quality assessment.

Section 03

Project Overview: Multi-Model RAG Benchmarking

The core of this research project is a comparable RAG pipeline architecture where all models share the same retrieval layer and only replace the generation component. This design ensures the fairness of evaluation results—any performance difference can be attributed to the model's own characteristics rather than changes in retrieval quality.

Four Models Evaluated

The study selected four representative open-source or open-access models:

OpenAI GPT: A commercial model accessed via API, used as a performance benchmark
Meta Llama 3.2-1B: A lightweight open-source model suitable for resource-constrained environments
AI2 OLMo: A fully open-source academic model with transparent training process
EleutherAI Pythia-2: A member of the research-oriented model family

This diverse model selection covers different orientations such as commercial API, lightweight open-source, and academic transparency, providing comprehensive references for practical deployment decisions.

Shared Retrieval Architecture

All RAG pipelines share the following technical components:

Vector search: FAISS for efficient similarity retrieval
Structured metadata storage: DuckDB for sample metadata queries
Embedding model: intfloat/multilingual-e5-large, supporting multilingual text
Data sources: Sample metadata, taxonomic information, and feature tables from VDP tables and the Fujita dataset

This unified infrastructure design ensures the comparability of experiments.

Section 04

Technical Implementation: From Data Preparation to Evaluation Process

The project's codebase clearly organizes each stage of the RAG pipeline:

Data Preparation Phase

The download_prep.py script handles initial data preparation:

Download NCBI Taxonomy ontology (about 1GB)
Build FAISS indexes for 9 different sentence transformer encoders
Cache results as .pkl and .index files to speed up subsequent runs

This step is the foundation of the entire process. The NCBI ontology provides an authoritative taxonomic knowledge base, and the index construction of multiple encoders allows selecting the optimal embedding model in subsequent steps.

Data Ingestion Pipeline

The full_pipeline2.ipynb notebook executes the core data ingestion process:

Load and normalize sample metadata from VDP and Fujita tables
Use FAISS similarity search to match taxonomic names in samples with the NCBI ontology
Generate embedding vectors and store them in DuckDB (document_vectors.db) and FAISS (docs.faiss)

RAG Pipeline Implementation

Each model has an independent RAG implementation script (RAGgpt.py, RAGllama.py, RAGolmo.py, RAGpythia2.py). These scripts share the same database connection and retrieval logic but use different models for answer generation.

RAGAS Evaluation Framework

RAGASeval.py provides a unified evaluation tool using five core metrics of the RAGAS framework:

Metric	Meaning
Faithfulness	Are the claims in the answer based on the retrieved context?
Answer Relevancy	Is the answer relevant to the question?
Context Recall	Does the retrieved context cover the true answer?
Context Precision	Are there no irrelevant paragraphs in the retrieved context?
Answer Correctness	How close is the answer to the true answer?

These five metrics comprehensively evaluate the performance of the RAG system from two dimensions: retrieval quality and generation quality.

Section 05

Configuration and Customizability

The project designs a flexible configuration mechanism where key parameters can be adjusted during initialization:

DuckDB path: document_vectors.db
FAISS index path: docs.faiss
Embedding model: intfloat/multilingual-e5-large
Retrieval Top-K: Default 50
OpenAI API key: Environment variable OPENAI_API_KEY

This design allows researchers to easily try different embedding models, adjust retrieval parameters, or adapt to different datasets.

Section 06

Research Value and Application Scenarios

This research has reference value for multiple groups:

Academic Research

For scholars engaged in bioinformatics, microbiome research, or biodiversity analysis, this project demonstrates how to apply modern NLP technology to professional field data retrieval. The use of the RAGAS evaluation framework also provides methodological references for similar studies.

Enterprise Applications

For enterprises that need to process large amounts of structured scientific data (such as pharmaceutical companies, agricultural technology companies), this research compares the performance gap between commercial APIs and open-source models, providing data support for technology selection decisions.

Model Developers

For open-source model developers, this research provides benchmark test results in a specific vertical field (biological data retrieval), helping to identify the model's strengths and improvement directions.

Educational Use

The Jupyter notebook format and clear code structure of the project make it an excellent teaching resource for learning RAG system construction and evaluation.

Section 07

Limitations and Future Research Directions

As an academic research project, it has some limitations:

Dataset size: Uses relatively small VDP and Fujita datasets; performance on larger-scale data needs to be verified
Domain specificity: Focuses on microbiome/biodiversity data; migration to other scientific fields requires additional work
Model scope: Only evaluates four models; a wider range of model comparisons (including newer model versions) will provide more valuable insights

Future research directions may include:

Expanding to more diverse biomedical datasets
Introducing more advanced retrieval technologies (such as multi-vector retrieval, query rewriting)
Exploring fine-tuning strategies to improve domain-specific performance
Developing an interactive web interface to lower the usage threshold

Section 08

Conclusion

This research project provides valuable empirical data for the application of open-source large language models in professional scientific data retrieval. Through systematic RAGAS evaluation, it not only compares the performance of different models but also demonstrates how to build a reproducible and comparable RAG evaluation framework.

For developers and researchers considering applying RAG technology to vertical fields, this project provides a complete reference implementation covering the entire process from data preparation to quality evaluation. In today's era of rapid improvement in open-source model capabilities, such benchmark research has important practical significance for guiding actual technology selection.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54