# TORU and SOTO RAG System: A Retrieval-Augmented Generation Q&A System for Enterprise Website Content

> A Retrieval-Augmented Generation (RAG) system combining semantic search and large language models, supporting crawling, chunking, indexing, and intelligent Q&A for enterprise website content, providing context-aware and accurate answers for robot interaction scenarios.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-10T13:15:55.000Z
- 最近活动: 2026-06-10T13:21:38.132Z
- 热度: 161.9
- 关键词: RAG, 检索增强生成, 大语言模型, 语义搜索, 向量数据库, 企业知识库, 问答系统, Magazino, 机器人
- 页面链接: https://www.zingnex.cn/en/forum/thread/torusoto-rag
- Canonical: https://www.zingnex.cn/forum/thread/torusoto-rag
- Markdown 来源: floors_fallback

---

## Introduction: Core Overview of the TORU and SOTO RAG System

The TORU and SOTO RAG system is a Retrieval-Augmented Generation (RAG) system that combines semantic search and large language models. It supports crawling, chunking, indexing, and intelligent Q&A for enterprise website content, providing context-aware and accurate answers for robot interaction scenarios. This project is a RAG system starter template related to the product line of Magazino, a German robotics company, demonstrating a complete method for building an enterprise knowledge Q&A pipeline.

## Background: RAG Technology and Enterprise Scenario Requirements

Retrieval-Augmented Generation (RAG) technology addresses the issues of knowledge timeliness, domain expertise, and hallucinations in LLMs by combining external knowledge bases with generative models. Magazino's TORU and SOTO autonomous mobile robots are widely used in warehouse logistics scenarios, where on-site staff need to quickly access product documents, technical specifications, and other information. This RAG system is designed precisely to meet the needs of this scenario.

## Methodology: Five Key Phases of the System Architecture

The core process of this RAG system is divided into five phases: 1. Web content crawling: Recursively crawl content from specified website URLs, handling link discovery, content filtering, deduplication, and rate control; 2. Text cleaning and preprocessing: Extract main text content, remove HTML tags and noise, and standardize formatting; 3. Text chunking and embedding indexing: Intelligent chunking (preserving semantic boundaries), generating embedding vectors, and storing them in an SQLite database; 4. Semantic retrieval and context assembly: Perform similarity search after embedding the user's question, and assemble relevant context; 5. LLM answer generation: Generate answers strictly aligned with enterprise content based on the retrieved context.

## Project Structure and Usage

The project adopts a layered architecture, with core modules including main.py (main entry), scraper.py (crawling), cleaner.py (cleaning), ingest.py (embedding indexing), and qa.py (Q&A). The data directory is divided into raw (original crawled data), cleaned (cleaned data), and embeddings (vector database). Usage: To build the index, execute `python -m src.main --ingest`; to perform Q&A, execute `python -m src.main --ask "question"`.

## Application Scenarios and Value Analysis

Typical application scenarios of this system include: 1. Enterprise internal knowledge base Q&A: Employees quickly query product documents and other information; 2. Customer service automation: Integrate customer service robots to answer customer inquiries; 3. On-site technical support: On-site staff of warehouse robots query troubleshooting guides and other materials; 4. Training and learning assistance: New employees quickly understand product technologies.

## Technical Key Points and Best Practices

Key technical points include: 1. Chunking strategy: Recursive character chunking is recommended, preserving title hierarchy; 2. Embedding model selection: Balance effectiveness and cost (e.g., OpenAI ada-002, open-source sentence-transformers); 3. Retrieval accuracy optimization: Query expansion, re-ranking models, hybrid retrieval (keyword + semantic).

## Limitations and Expansion Directions

Current limitations and expansion directions of the project: 1. Modules to be implemented: Core modules such as scraper and cleaner need to be customized according to the target website; 2. Incremental updates: Support incremental indexing for changes in website content; 3. Multimodal support: Extend to image, video, and other formats; 4. Conversation history management: Support coherent context for multi-turn dialogues.

## Conclusion: RAG as Infrastructure for LLM Applications

The TORU-and-SOTO-RAG-system demonstrates a typical implementation of the RAG architecture, connecting general LLM capabilities with domain-specific knowledge to ensure answer accuracy and timeliness. RAG has become a standard architecture for enterprise LLM applications, and this project provides a clear starting point. In the future, RAG will integrate with technologies such as multimodality and Agents, continuing to serve as a key bridge.
