Reading

THEMIS: A Parameterized Legal Reasoning Engine Tailored for Indian Law

THEMIS is a large language model fine-tuned specifically for Indian statutory law. It is not a retrieval system or chatbot wrapper, but a parameterized knowledge model that directly encodes legal knowledge into its model weights.

LLM法律AI印度法律LoRA微调领域特定模型参数化知识Mistral法律科技

Published 2026-06-11 02:12Recent activity 2026-06-11 02:19Estimated read 8 min

Section 01

THEMIS: A Parameterized Legal Reasoning Engine Tailored for Indian Law (Introduction)

THEMIS is a parameterized legal reasoning engine fine-tuned via LoRA on the Mistral 7B Instruct v0.3 model, focusing on the Indian legal domain. Unlike Retrieval-Augmented Generation (RAG) systems, it directly encodes legal knowledge into its model weights, aiming to achieve lawyer-like reasoning capabilities rather than simple retrieval. This article will detail its background, technical architecture, v1 version achievements and limitations, future roadmap, and domain insights.

Section 02

Project Background and Motivation

Background

In the field of legal AI, most solutions rely on Retrieval-Augmented Generation (RAG), which searches for relevant legal provisions and injects them into prompts to generate answers. THEMIS takes a different approach: directly encoding Indian legal knowledge into neural network weights to build a parameterized knowledge model.

Naming and Positioning

The project is named after Themis, the Greek goddess of justice, symbolizing law and order. Developers clearly distinguish THEMIS from retrieval systems (e.g., HECTOR): HECTOR handles retrieval, while THEMIS focuses on reasoning—deriving answers from internalized knowledge rather than searching for information.

Section 03

Technical Architecture and Implementation Path

Base Model Selection

THEMIS v1 uses Mistral 7B Instruct v0.3 due to its strong instruction-following ability and moderate parameter size, allowing fine-tuning on limited resources (e.g., Kaggle T4 GPU).

LoRA Fine-Tuning Strategy

It uses LoRA (Low-Rank Adaptation) technology; v1 is configured with rank=8 and only adapts the q_proj and v_proj attention modules, balancing computational efficiency and deployment flexibility.

Data Construction

v1 uses 1,939 pairs of Alpaca-style legal Q&A data, covering core texts such as the Indian Penal Code (IPC), the Bharatiya Nyaya Sanhita 2023 (BNS), and the Bharatiya Nagarik Suraksha Sanhita (BNSS).

Section 04

v1 Version Achievements and Limitations

Achieved Capabilities

Instruction following: Responds in a legal assistant style and organizes answers as required
Automatically appends legal disclaimers
Structured output (including citations and suggestions)
The complete fine-tuning pipeline runs on Kaggle T4, and the LoRA adapter is published to the HuggingFace Hub

Limitations

Incorrect BNS abbreviation recognition
Provision number hallucination (inaccurate citations)
Insufficient knowledge depth (small training data volume)
Lack of mapping from IPC to BNS

Root Causes

The pre-training data of Mistral 7B ends before the 2023 BNS came into effect, so the base model has no prior knowledge of BNS; LoRA only taught the model to "speak like a lawyer" but did not fill the knowledge gap.

Section 05

Evolution Roadmap (v2 to v3 and Long-Term Vision)

v2 Improvement Goals

Parameter	v1 Value	v2 Target	Improvement Significance
LoRA rank	8	16	Stronger expressive ability
Attention modules	q_proj, v_proj	q, k, v, o_proj	Capture richer features
Sequence length	512 tokens	1024 tokens	Support longer texts
Training data	1939 pairs	15000 pairs	Support knowledge learning

Success criteria: Correctly recognize BNS, and 70%+ of criminal law query provision citations are accurate.

v3 Vision

Plans to use 74,000 pairs of training data (covering BNS, IPC, Supreme Court judgments, etc.), with targets of LoRA rank=32, sequence length=2048 tokens, citation accuracy over 85%, and hallucination rate below 10%.

Long-Term Architecture

Integrate THEMIS (reasoning) and HECTOR (retrieval): User query → Query classifier ("Parameterized reasoning or retrieval-augmented?") → Call THEMIS or HECTOR → Unified answer (including citations and reasoning).

Section 06

Insights for Domain-Specific LLM Development

Pre-training knowledge gap: If the target domain knowledge emerges after the base model's pre-training, simple fine-tuning is difficult to fill the gap; larger-scale domain pre-training or retrieval supplementation is needed.
Data scale critical point: "Learning to speak" and "learning knowledge" require different data volumes; the data volume from v1 to v3 spans two orders of magnitude.
Parameterized vs. retrieval trade-off: Parameterized reasoning is fast and smooth but prone to hallucinations; retrieval augmentation is verifiable but complex—intelligent combination is needed in the future.

Section 07

Conclusion

THEMIS is an ambitious and transparent project; the v1 version clearly demonstrates its capability boundaries and improvement paths. It provides a valuable case for legal AI, domain-specific LLMs, and responsible AI development—not just a collection of code and models, but also a practical guide for adapting general models to professional domains. We look forward to the v2 and v3 versions unlocking the potential of parameterized legal reasoning.