Zing Forum

Reading

Turkish Legal RAG System: A Complete Implementation Path from Baseline to Optimization

A Retrieval-Augmented Generation (RAG) question-answering system for the Turkish legal domain, which achieves a complete optimization path from baseline to high performance through technical means such as embedding model selection, re-ranking, and QLoRA fine-tuning.

RAG法律问答土耳其嵌入模型QLoRA重排序密集检索大语言模型
Published 2026-05-26 23:13Recent activity 2026-05-26 23:19Estimated read 6 min
Turkish Legal RAG System: A Complete Implementation Path from Baseline to Optimization
1

Section 01

Turkish Legal RAG System: A Complete Implementation Path from Baseline to Optimization (Main Floor Introduction)

This project is a Retrieval-Augmented Generation (RAG) question-answering system for the Turkish legal domain. It aims to address the "hallucination" issue of general large language models when handling legal problems. Through technologies like embedding model selection, re-ranking, and QLoRA fine-tuning, it achieves a complete optimization path from baseline to high performance, provides traceable legal basis citations, and offers practical references for building vertical domain RAG systems.

2

Section 02

Project Background and Motivation

Question-answering in the legal domain faces challenges such as rigor, dense terminology, and the need for answers based on official texts. General LLMs tend to generate content without basis. The Turkish Legal RAG project builds an end-to-end pipeline, optimized for Turkish legal corpora, combining dense retrieval and local LLM inference to ensure answer accuracy and traceability.

3

Section 03

Corpus Composition

The core corpus includes seven basic Turkish laws: Constitution, Criminal Code, Code of Criminal Procedure, Civil Code, Code of Obligations, Code of Civil Procedure, and Code of Administrative Procedure; reserved directories for cases from the Grand National Assembly of Turkey (TBMM) and the Supreme Court (Yargıtay) (currently empty); 175 benchmark test questions based on the above seven laws, leaving room for future expansion.

4

Section 04

Technical Architecture and Progressive Optimization Path

Ablation experiments are used to verify component contributions, and the optimization path is divided into five stages:

  1. Baseline system: e5-base embedding + Qwen2.5-3B-Instruct generation, establishing a reference benchmark;
  2. Embedding model upgrade: e5-base → e5-large, MRR increased by 14.9%;
  3. Introduce re-ranker: Zero-shot deployment of BAAI/bge-reranker-v2-m3 for secondary screening of retrieval results;
  4. Prompt engineering: Design legal scenario templates, introduce citation discipline and "Dayanak" format specifications;
  5. QLoRA fine-tuning: Train Qwen2.5-3B-Instruct with 112 examples for 3 epochs, F1 increased by 14.6%, and faithfulness increased by 15.9%.
5

Section 05

Analysis of Key Technical Details

  • Dense retrieval and FAISS: Use FAISS vector database to support efficient similarity search; text chunking and embedding model selection affect retrieval performance;
  • Cross-encoder re-ranking: BAAI/bge-reranker-v2-m3 captures fine-grained semantic relationships, serving as the second-stage re-ranker to balance performance and efficiency;
  • QLoRA fine-tuning: 4-bit quantization + low-rank adapter reduces memory requirements; 112 examples cover multiple legal domains and question types; 3 epochs avoid overfitting.
6

Section 06

Practical Significance and Insights

  1. Embedding model selection is crucial: The upgrade to e5-large brings a significant MRR improvement;
  2. Re-ranker has high cost-effectiveness: Zero-shot deployment can improve result quality;
  3. Domain fine-tuning is a qualitative leap: QLoRA fine-tuning greatly enhances answer faithfulness, suitable for high-risk domains;
  4. Prompt engineering cannot be ignored: Citation discipline and format specifications improve answer credibility and user experience.
7

Section 07

Limitations and Future Directions

Limitations: Only covers seven basic laws, not including case law and parliamentary legislative records; Future: Expand content of case law and legislative records; Universality: The methodology (progressive optimization, ablation experiments, configuration-driven design) can be referenced across languages and domains.

8

Section 08

Project Summary

The Turkish Legal RAG project demonstrates the complete development process of a vertical domain question-answering system. Each technical decision is supported by experimental data, providing a reference implementation worthy of in-depth study for developers of professional domain RAG systems.