# Systematic Evaluation of RAG Technology in the Field of Space Missions

> This article provides an in-depth analysis of a comprehensive evaluation study on Retrieval-Augmented Generation (RAG) systems in the aerospace field, covering comparative analyses of retrieval strategies, embedding models, re-rankers, and the answer quality of large language models, offering important references for AI applications in high-risk domains.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-23T18:44:19.000Z
- 最近活动: 2026-05-23T18:47:30.334Z
- 热度: 154.9
- 关键词: RAG, 检索增强生成, 航天, 嵌入模型, 重排序, BM25, BGE-M3, 大语言模型, 知识检索, 领域特定AI
- 页面链接: https://www.zingnex.cn/en/forum/thread/rag-aab3a866
- Canonical: https://www.zingnex.cn/forum/thread/rag-aab3a866
- Markdown 来源: floors_fallback

---

## [Introduction] Systematic Evaluation of RAG Technology in the Aerospace Field

This is a comprehensive evaluation study on Retrieval-Augmented Generation (RAG) systems in the aerospace field, conducted by a joint team from Portugal's NOVA LINCS Laboratory, Neuraspace, and the Technical University of Munich. The source is the GitHub project "rag-space-eval" (released on May 23, 2026). The study covers comparative analyses of retrieval strategies, embedding models, re-rankers, and the answer quality of large language models, providing important empirical references for AI applications in high-risk domains.

## Research Background: Knowledge Management Challenges in the Aerospace Field

Space mission operations are complex and time-sensitive, involving the processing of massive heterogeneous documents, and engineers need to quickly obtain accurate information. Traditional document retrieval struggles to meet these needs, and RAG technology offers new possibilities to address this challenge. This study systematically evaluates RAG technology stack components in response to the special needs of the aerospace field, filling the gap in evaluation for this domain.

## Research Objectives and Evaluation Framework

The core objective is to establish an evaluation framework for RAG systems in the aerospace field, with multi-dimensional experiments:
1. Comparison of retrieval strategies: Advantages and disadvantages of sparse retrieval (BM25) vs. dense retrieval (vector embedding)
2. Selection of embedding models: 8 advanced models from the MMTEB leaderboard (including BGE-M3, Qwen series)
3. Evaluation of re-rankers: Integration of 3 models (BGE-M3, GTE reranker-base, Jina reranker-v2) to reduce bias
4. Analysis of answer quality: Evaluation of the accuracy and reliability of large language models in professional Q&A

## Integrated Evaluation Strategy for Re-rankers

The study uses an innovative integration method to verify the effectiveness of re-rankers, avoiding association bias between a single model and the embedding ecosystem. Experimental results show that on the Golden-Offset and Golden-Aligned test subsets, all re-rankers maintain high F1 scores and accuracy, indicating that the relevance signals for document retrieval in the aerospace field are stable and reliable, suitable for downstream quality evaluation.

## In-depth Comparative Analysis of Embedding Models

Eight embedding models plus the BM25 baseline were selected. The evaluation method is: BM25 retrieves the top 100 paragraphs → re-ranker integration to construct approximate ground truth. Evaluation dimensions include recall, precision, NDCG, and Kendall Tau, tested with 2000/512 token chunk sizes. Key findings: BM25 has outstanding recall and efficiency; dense models like BGE-M3 and Qwen series have better ranking quality (NDCG).

## Impact Analysis of Chunk Size and Re-ranking

A 0-3 relevance scoring system was used (0 = irrelevant, 3 = highly relevant), testing Top3/5/7/10 results and two chunk sizes:
1. Re-ranking significantly improves relevance: reduces the proportion of low-relevance (0/1 points) and increases the proportion of high-relevance (3 points). The 512-token chunk size shows more obvious improvement (e.g., the proportion of 3 points under Top3 increases from 42.54% to 48.37%)
2. The distribution change pattern of moderately relevant (2 points) paragraphs is special, requiring attention to processing strategies.

## Practical Application Insights and Future Outlook

Practical recommendations:
1. Architecture selection: For retrieval + re-ranking pipelines, prioritize BM25 (high recall, low latency); for retrieval-only scenarios, use dense models like BGE-M3
2. Chunk strategy: 512-token fine-grained chunks yield better results after re-ranking
3. Integration method: Re-ranker integration reduces bias and can be extended to high-risk domains
This study provides methodological references for RAG applications in professional fields such as healthcare and law. Future work needs to address the challenges of reliable knowledge retrieval and generation in specific domains.
