Reading

KBQA-R1: Using Reinforcement Learning to Make Large Language Models Better at Knowledge Base Question Answering

KBQA-R1 is a reinforcement learning-based knowledge base question answering (KBQA) framework. By modeling KBQA as a multi-turn Markov Decision Process (MDP) and combining it with the Group Relative Policy Optimization (GRPO) strategy, it achieves significant improvements on the WebQSP and GrailQA datasets.

KBQA强化学习大语言模型知识库问答GRPO马尔可夫决策过程自然语言处理

Published 2026-06-02 20:45Recent activity 2026-06-02 20:48Estimated read 7 min

KBQA-R1: Using Reinforcement Learning to Make Large Language Models Better at Knowledge Base Question Answering

Section 01

[Introduction] KBQA-R1: A New Framework for Knowledge Base Question Answering Empowering Large Language Models with Reinforcement Learning

KBQA-R1 is a reinforcement learning-based knowledge base question answering (KBQA) framework. Its core is modeling KBQA as a multi-turn Markov Decision Process (MDP) and combining it with the Group Relative Policy Optimization (GRPO) strategy, achieving significant improvements on the WebQSP and GrailQA datasets. This framework includes key innovations such as action-centric design, Reference Rejection Sampling (RRS) data synthesis, and a four-stage training pipeline, providing a new paradigm for the interaction between large language models (LLMs) and external knowledge bases.

Section 02

Background: Existing Challenges in Knowledge Base Question Answering and Dilemmas of LLM Applications

Knowledge Base Question Answering (KBQA) aims to enable models to answer natural language questions using structured knowledge bases. Traditional methods consist of two steps: semantic parsing to generate queries and executing the queries. However, directly applying LLMs to KBQA faces two major challenges: first, the large scale of knowledge bases makes it difficult to fit all into the context; second, complex questions require multi-step reasoning, which single-turn generation cannot handle well.

Section 03

Core Methods: MDP Modeling, Action Design, and GRPO Optimization

MDP Modeling

KBQA-R1 defines KBQA as a multi-turn MDP, optimizing the reasoning strategy via reinforcement learning without the need for manually annotated intermediate steps.

Action Space

Seven types of actions are designed: Find_Relation (find entity relationships), Merge (merge results), Order (sort), Compare (attribute comparison), Time_Constraint (time constraint), Count (count), and Finish (return answer), supporting multi-step reasoning.

RRS Data Synthesis

Use stronger models (e.g., Qwen2.5-72B) to generate candidate trajectories, filter correct paths through execution verification, and provide high-quality data for supervised fine-tuning.

GRPO Optimization

The GRPO algorithm is adopted, which does not require an additional value function network. It estimates advantages via relative rewards of in-group samples, reducing training instability, with rewards based on the correctness of the final answer.

Four-Stage Training

Rejection sampling data preparation: Add action prompts, generate candidate trajectories and filter them; 2. Supervised Fine-Tuning (SFT): Fine-tune Llama-3.1-8B-Instruct using the filtered data; 3. GRPO reinforcement learning: Optimize the strategy; 4. Evaluation and deployment: Evaluate on standard datasets and provide a Hugging Face repository.

Section 04

Experimental Environment and Deployment Details

Computational Resources

Requires 8×NVIDIA A100/H100 (80GB VRAM).

Dependencies

Python 3.10+, PyTorch 2.0+.

Knowledge Base

Uses Freebase, provides SPARQL endpoints via the Virtuoso engine. The project offers a 53GB+ database download and configuration guide.

Reproduction and Deployment

The project provides complete code implementation, training flow, and Hugging Face model repository, making it easy for researchers to use directly.

Section 05

Practical Significance and Application Prospects

KBQA-R1 demonstrates a new paradigm of "reinforcement learning enabling LLMs to interact with external knowledge bases". Its significance lies in:

Improving benchmark test scores;
Expanding application scenarios: enterprise knowledge management (querying internal knowledge graphs), medical question answering (combining medical knowledge bases), financial analysis (extracting insights from structured financial data);
Compared to Retrieval-Augmented Generation (RAG), it is better at handling complex multi-hop reasoning problems and is suitable for knowledge-intensive scenarios.

Section 06

Summary and Reflections: Progress, Limitations, and Future Directions

KBQA-R1 is an important progress in the KBQA field. It achieves significant improvements through reinforcement learning + action design + GRPO optimization, and provides complete code and training flow, serving as a high-quality starting point for the KBQA+RL field.

Limitations: High computational resource requirements, limiting participation of some researchers.

Future directions: Explore lightweight training schemes, or apply the method to other knowledge bases such as Wikidata.