Zing Forum

Reading

Systematic Evaluation of RAG Technology in the Field of Space Missions

This article provides an in-depth analysis of a comprehensive evaluation study on Retrieval-Augmented Generation (RAG) systems in the aerospace field, covering comparative analyses of retrieval strategies, embedding models, re-rankers, and the answer quality of large language models, offering important references for AI applications in high-risk domains.

RAG检索增强生成航天嵌入模型重排序BM25BGE-M3大语言模型知识检索领域特定AI
Published 2026-05-24 02:44Recent activity 2026-05-24 02:47Estimated read 6 min
Systematic Evaluation of RAG Technology in the Field of Space Missions
1

Section 01

[Introduction] Systematic Evaluation of RAG Technology in the Aerospace Field

This is a comprehensive evaluation study on Retrieval-Augmented Generation (RAG) systems in the aerospace field, conducted by a joint team from Portugal's NOVA LINCS Laboratory, Neuraspace, and the Technical University of Munich. The source is the GitHub project "rag-space-eval" (released on May 23, 2026). The study covers comparative analyses of retrieval strategies, embedding models, re-rankers, and the answer quality of large language models, providing important empirical references for AI applications in high-risk domains.

2

Section 02

Research Background: Knowledge Management Challenges in the Aerospace Field

Space mission operations are complex and time-sensitive, involving the processing of massive heterogeneous documents, and engineers need to quickly obtain accurate information. Traditional document retrieval struggles to meet these needs, and RAG technology offers new possibilities to address this challenge. This study systematically evaluates RAG technology stack components in response to the special needs of the aerospace field, filling the gap in evaluation for this domain.

3

Section 03

Research Objectives and Evaluation Framework

The core objective is to establish an evaluation framework for RAG systems in the aerospace field, with multi-dimensional experiments:

  1. Comparison of retrieval strategies: Advantages and disadvantages of sparse retrieval (BM25) vs. dense retrieval (vector embedding)
  2. Selection of embedding models: 8 advanced models from the MMTEB leaderboard (including BGE-M3, Qwen series)
  3. Evaluation of re-rankers: Integration of 3 models (BGE-M3, GTE reranker-base, Jina reranker-v2) to reduce bias
  4. Analysis of answer quality: Evaluation of the accuracy and reliability of large language models in professional Q&A
4

Section 04

Integrated Evaluation Strategy for Re-rankers

The study uses an innovative integration method to verify the effectiveness of re-rankers, avoiding association bias between a single model and the embedding ecosystem. Experimental results show that on the Golden-Offset and Golden-Aligned test subsets, all re-rankers maintain high F1 scores and accuracy, indicating that the relevance signals for document retrieval in the aerospace field are stable and reliable, suitable for downstream quality evaluation.

5

Section 05

In-depth Comparative Analysis of Embedding Models

Eight embedding models plus the BM25 baseline were selected. The evaluation method is: BM25 retrieves the top 100 paragraphs → re-ranker integration to construct approximate ground truth. Evaluation dimensions include recall, precision, NDCG, and Kendall Tau, tested with 2000/512 token chunk sizes. Key findings: BM25 has outstanding recall and efficiency; dense models like BGE-M3 and Qwen series have better ranking quality (NDCG).

6

Section 06

Impact Analysis of Chunk Size and Re-ranking

A 0-3 relevance scoring system was used (0 = irrelevant, 3 = highly relevant), testing Top3/5/7/10 results and two chunk sizes:

  1. Re-ranking significantly improves relevance: reduces the proportion of low-relevance (0/1 points) and increases the proportion of high-relevance (3 points). The 512-token chunk size shows more obvious improvement (e.g., the proportion of 3 points under Top3 increases from 42.54% to 48.37%)
  2. The distribution change pattern of moderately relevant (2 points) paragraphs is special, requiring attention to processing strategies.
7

Section 07

Practical Application Insights and Future Outlook

Practical recommendations:

  1. Architecture selection: For retrieval + re-ranking pipelines, prioritize BM25 (high recall, low latency); for retrieval-only scenarios, use dense models like BGE-M3
  2. Chunk strategy: 512-token fine-grained chunks yield better results after re-ranking
  3. Integration method: Re-ranker integration reduces bias and can be extended to high-risk domains This study provides methodological references for RAG applications in professional fields such as healthcare and law. Future work needs to address the challenges of reliable knowledge retrieval and generation in specific domains.