Section 01
【Introduction】Research on RAG Applications of Open-Source Large Language Models in Microbiome Data Retrieval
Core Research Content
This research project focuses on the performance of open-source large language models in RAG pipelines for microbiome/biodiversity sample data retrieval tasks. It compares four representative models (OpenAI GPT, Meta Llama 3.2-1B, AI2 OLMo, EleutherAI Pythia-2) and uses the RAGAS framework for comprehensive evaluation.
Research Objectives
To explore the performance differences of different large language models in RAG tasks, and whether open-source models can meet the knowledge retrieval needs of professional fields, providing references for practical deployment decisions.
Key Technical Components
A shared retrieval layer (FAISS vector search, DuckDB structured storage, intfloat/multilingual-e5-large embedding model) ensures evaluation fairness. Data sources include metadata, taxonomic information from VDP tables and the Fujita dataset.