# Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval

> An academic research project comparing the performance of four open-source large language models (GPT, Llama, OLMo, Pythia) in RAG pipelines for microbiome/biodiversity sample data retrieval tasks, with comprehensive evaluation using the RAGAS framework.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-04-30T09:36:08.000Z
- 最近活动: 2026-04-30T09:58:21.847Z
- 热度: 161.6
- 关键词: RAG, 大语言模型, 微生物组, 生物多样性, NCBI, FAISS, RAGAS评估, 开源模型, 科学数据检索
- 页面链接: https://www.zingnex.cn/en/forum/thread/rag-7a5012dd
- Canonical: https://www.zingnex.cn/forum/thread/rag-7a5012dd
- Markdown 来源: floors_fallback

---

## 【Introduction】Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval

### Core Research Content
This research project focuses on the performance of open-source large language models in RAG pipelines for microbiome/biodiversity sample data retrieval tasks. It compares four representative models (OpenAI GPT, Meta Llama 3.2-1B, AI2 OLMo, EleutherAI Pythia-2) and uses the RAGAS framework for comprehensive evaluation.

### Research Objectives
To explore the performance differences of different large language models in RAG tasks, and whether open-source models can meet the knowledge retrieval needs of professional fields, providing references for practical deployment decisions.

### Key Technical Components
A shared retrieval layer (FAISS vector search, DuckDB structured storage, intfloat/multilingual-e5-large embedding model) ensures evaluation fairness. Data sources include metadata, taxonomic information from VDP tables and the Fujita dataset.

## Research Background: Intelligent Needs for Biological Data Retrieval

Microbiome research and biodiversity analysis generate massive structured data, including sample metadata, taxonomic information, and feature tables. These data are usually stored in complex tables and databases. For non-professional users or scenarios requiring quick information access, traditional SQL queries or manual retrieval methods are inefficient.

Retrieval-Augmented Generation (RAG) technology provides a new possibility to solve this problem. By converting natural language queries into precise retrieval of structured data, RAG systems allow researchers to ask questions in daily language and get accurate answers based on actual data. However, how do different large language models perform in RAG tasks? Can open-source models胜任 professional field knowledge retrieval?

This 2025 academic research project aims to answer these questions. It builds a complete RAG pipeline, evaluates four different open-source large language models on microbiome/biodiversity datasets, and uses the RAGAS framework for comprehensive quality assessment.

## Project Overview: Multi-Model RAG Benchmarking

The core of this research project is a comparable RAG pipeline architecture where all models share the same retrieval layer and only replace the generation component. This design ensures the fairness of evaluation results—any performance difference can be attributed to the model's own characteristics rather than changes in retrieval quality.

### Four Models Evaluated
The study selected four representative open-source or open-access models:
1. **OpenAI GPT**: A commercial model accessed via API, used as a performance benchmark
2. **Meta Llama 3.2-1B**: A lightweight open-source model suitable for resource-constrained environments
3. **AI2 OLMo**: A fully open-source academic model with transparent training process
4. **EleutherAI Pythia-2**: A member of the research-oriented model family

This diverse model selection covers different orientations such as commercial API, lightweight open-source, and academic transparency, providing comprehensive references for practical deployment decisions.

### Shared Retrieval Architecture
All RAG pipelines share the following technical components:
- **Vector search**: FAISS for efficient similarity retrieval
- **Structured metadata storage**: DuckDB for sample metadata queries
- **Embedding model**: intfloat/multilingual-e5-large, supporting multilingual text
- **Data sources**: Sample metadata, taxonomic information, and feature tables from VDP tables and the Fujita dataset

This unified infrastructure design ensures the comparability of experiments.

## Technical Implementation: From Data Preparation to Evaluation Process

The project's codebase clearly organizes each stage of the RAG pipeline:

### Data Preparation Phase
The `download_prep.py` script handles initial data preparation:
- Download NCBI Taxonomy ontology (about 1GB)
- Build FAISS indexes for 9 different sentence transformer encoders
- Cache results as .pkl and .index files to speed up subsequent runs

This step is the foundation of the entire process. The NCBI ontology provides an authoritative taxonomic knowledge base, and the index construction of multiple encoders allows selecting the optimal embedding model in subsequent steps.

### Data Ingestion Pipeline
The `full_pipeline2.ipynb` notebook executes the core data ingestion process:
- Load and normalize sample metadata from VDP and Fujita tables
- Use FAISS similarity search to match taxonomic names in samples with the NCBI ontology
- Generate embedding vectors and store them in DuckDB (document_vectors.db) and FAISS (docs.faiss)

### RAG Pipeline Implementation
Each model has an independent RAG implementation script (RAGgpt.py, RAGllama.py, RAGolmo.py, RAGpythia2.py). These scripts share the same database connection and retrieval logic but use different models for answer generation.

### RAGAS Evaluation Framework
`RAGASeval.py` provides a unified evaluation tool using five core metrics of the RAGAS framework:

| Metric | Meaning |
|--------|---------|
| Faithfulness | Are the claims in the answer based on the retrieved context? |
| Answer Relevancy | Is the answer relevant to the question? |
| Context Recall | Does the retrieved context cover the true answer? |
| Context Precision | Are there no irrelevant paragraphs in the retrieved context? |
| Answer Correctness | How close is the answer to the true answer? |

These five metrics comprehensively evaluate the performance of the RAG system from two dimensions: retrieval quality and generation quality.

## Configuration and Customizability

The project designs a flexible configuration mechanism where key parameters can be adjusted during initialization:

- **DuckDB path**: document_vectors.db
- **FAISS index path**: docs.faiss
- **Embedding model**: intfloat/multilingual-e5-large
- **Retrieval Top-K**: Default 50
- **OpenAI API key**: Environment variable OPENAI_API_KEY

This design allows researchers to easily try different embedding models, adjust retrieval parameters, or adapt to different datasets.

## Research Value and Application Scenarios

This research has reference value for multiple groups:

### Academic Research
For scholars engaged in bioinformatics, microbiome research, or biodiversity analysis, this project demonstrates how to apply modern NLP technology to professional field data retrieval. The use of the RAGAS evaluation framework also provides methodological references for similar studies.

### Enterprise Applications
For enterprises that need to process large amounts of structured scientific data (such as pharmaceutical companies, agricultural technology companies), this research compares the performance gap between commercial APIs and open-source models, providing data support for technology selection decisions.

### Model Developers
For open-source model developers, this research provides benchmark test results in a specific vertical field (biological data retrieval), helping to identify the model's strengths and improvement directions.

### Educational Use
The Jupyter notebook format and clear code structure of the project make it an excellent teaching resource for learning RAG system construction and evaluation.

## Limitations and Future Research Directions

As an academic research project, it has some limitations:

- **Dataset size**: Uses relatively small VDP and Fujita datasets; performance on larger-scale data needs to be verified
- **Domain specificity**: Focuses on microbiome/biodiversity data; migration to other scientific fields requires additional work
- **Model scope**: Only evaluates four models; a wider range of model comparisons (including newer model versions) will provide more valuable insights

Future research directions may include:
- Expanding to more diverse biomedical datasets
- Introducing more advanced retrieval technologies (such as multi-vector retrieval, query rewriting)
- Exploring fine-tuning strategies to improve domain-specific performance
- Developing an interactive web interface to lower the usage threshold

## Conclusion

This research project provides valuable empirical data for the application of open-source large language models in professional scientific data retrieval. Through systematic RAGAS evaluation, it not only compares the performance of different models but also demonstrates how to build a reproducible and comparable RAG evaluation framework.

For developers and researchers considering applying RAG technology to vertical fields, this project provides a complete reference implementation covering the entire process from data preparation to quality evaluation. In today's era of rapid improvement in open-source model capabilities, such benchmark research has important practical significance for guiding actual technology selection.
