Zing Forum

Reading

New Paradigm for Intelligent Q&A on Government Documents: Technical Analysis of South Africa's Budget RAG Chatbot

This article provides an in-depth analysis of a question-answering system for South Africa's national budget documents based on Retrieval-Augmented Generation (RAG) technology, demonstrating how the RAG architecture enables large language models to accurately answer cross-year budget queries based on official PDF documents.

RAG检索增强生成政府文档预算分析向量数据库ChromaDBLangChainLLaMA 3问答系统
Published 2026-04-25 03:45Recent activity 2026-04-25 03:52Estimated read 5 min
New Paradigm for Intelligent Q&A on Government Documents: Technical Analysis of South Africa's Budget RAG Chatbot
1

Section 01

[Introduction] Technical Analysis of South Africa's Budget RAG Chatbot: A New Paradigm for Intelligent Q&A on Government Documents

This article introduces a question-answering system for South Africa's national budget documents based on Retrieval-Augmented Generation (RAG) technology. The system solves the problem of ordinary users querying complex government budget PDF documents, supports functions such as cross-year budget comparison, and uses a tech stack including ChromaDB, LangChain, LLaMA 3, etc., providing a new paradigm for intelligent Q&A on government documents.

2

Section 02

Project Background and Requirements

Under the demand for government transparency, South Africa's budget documents are mostly hundreds of pages long in PDF format, making it difficult for non-professionals to extract information. This project uses RAG technology to build an intelligent bridge, allowing users to query 2023-2026 budget documents in natural language and solve the pain points of traditional queries.

3

Section 03

Core Technical Architecture and Components

The system adopts a classic RAG architecture. The process is: PyPDF loads PDF → intelligent chunking → Sentence Transformers generates embeddings → stores in ChromaDB; during inference, it first semantically retrieves relevant fragments, then combines with LLaMA3 to generate answers. Key components include LangChain (pipeline construction), ChromaDB (lightweight vector database), Sentence Transformers (embedding model), and LLaMA3 from the Groq platform (generation capability).

4

Section 04

Functional Features and Use Cases

Supports cross-year budget comparison (e.g., changes in education expenditure), VAT policy tracking, fund allocation analysis (infrastructure/healthcare/education, etc.), and budget trend summary. It is suitable for journalists, researchers, policy analysts, and ordinary citizens, and is more efficient than manually flipping through PDFs.

5

Section 05

Code Structure and Implementation Details

Modular code design: src/chain.py (main RAG pipeline), src/ingest.py (PDF processing), src/vectorstore.py (embedding and vector database), src/llm.py (LLM call); data/ stores original PDFs, db/ stores vector databases; uses python-dotenv to manage Groq API keys to avoid hardcoding.

6

Section 06

Deployment and Usage Guide

Deployment steps: Clone the repository → create a virtual environment → install dependencies → place PDFs into data/ → configure Groq API key → run python -m src.chain to start the interactive interface. The vector index is built automatically on the first run, and subsequent queries are fast.

7

Section 07

Technical Insights and Promotion Value

This solution has strong generality and can be adapted to documents in other fields (such as corporate financial reports, legal provisions); it provides a complete RAG reference for developers and ideas for governments to improve transparency; future functions can include multilingual support, table processing, citation tracing, Web interface, etc.