Zing Forum

Reading

Evaluation of Japanese Bar Exam Writing Tasks: Expert Review of Large Language Models' Open Legal Reasoning Capabilities

The research team constructed the first LLM open reasoning evaluation dataset for the Japanese legal domain. Through manual evaluation by legal experts, it reveals the limitations and hallucination issues of current large models in legal reasoning.

法律推理评估日本司法考试开放式问答幻觉分析专家评估跨法律传统
Published 2026-04-26 22:15Recent activity 2026-04-28 09:59Estimated read 5 min
Evaluation of Japanese Bar Exam Writing Tasks: Expert Review of Large Language Models' Open Legal Reasoning Capabilities
1

Section 01

[Introduction] Expert Evaluation Study of LLM Open Legal Reasoning Capabilities from the Perspective of the Japanese Bar Exam

This study constructed the first LLM open reasoning evaluation dataset for the Japanese legal domain, using Japanese Bar Exam writing tasks as the scenario. Through manual evaluation by legal experts, it reveals the limitations of current large models in legal reasoning (such as incomplete problem identification, loose argument structure, etc.) and hallucination issues (fictional precedents, incorrect citation of legal provisions, etc.). It fills the gap in AI capability evaluation across legal traditions and provides references for the safe and reliable development of legal AI.

2

Section 02

Research Background: Deficiencies in Legal AI Evaluation and the Unique Value of the Japanese Context

Current legal AI evaluations mostly focus on multiple-choice questions, lacking assessment of open reasoning capabilities required for real legal practice. The Japanese legal system belongs to the civil law tradition, which is significantly different from the common law system; its bar exam is highly difficult and requires comprehensive legal capabilities. Previously, there was no LLM open reasoning evaluation dataset for the Japanese legal context. This study fills this gap and provides data support for cross-legal tradition comparisons.

3

Section 03

Research Methods: Dataset Construction and Expert Evaluation Process

The dataset is based on actual writing questions from the Japanese Bar Exam, featuring long case narratives, multi-problem identification, structured argument requirements, etc. The study invited experts with Japanese legal professional backgrounds to manually review the answers generated by LLMs. Although the cost is high, it can accurately grasp the real capabilities of the models.

4

Section 04

Research Findings: Limitations and Hallucination Issues of LLMs in Legal Reasoning

Expert evaluations reveal LLM limitations: incomplete legal problem identification (easily missing secondary issues), loose argument structure (insufficient logical rigor), and incorrect application of legal knowledge (citing wrong provisions or having understanding deviations). Hallucination issues manifest as fictional precedents, incorrect citation of repealed legal provisions, and over-inference based on limited facts—these errors are extremely risky in legal scenarios.

5

Section 05

Conclusions and Implications: Directions for Legal AI Development and Reflections on Legal Education

Research implications: The evaluation system needs to be improved by adding open reasoning tasks; legal AI applications should be limited to auxiliary scenarios, with major decisions requiring human lawyers' judgment; cross-legal tradition migration requires targeted evaluation; hallucination issues need to be prioritized. For legal education, the performance of LLMs reflects that they still have a gap in mastering legal thinking, suggesting that legal education should attach importance to the cultivation of comprehensive capabilities.

6

Section 06

Research Limitations and Future Research Directions

Limitations: Limited sample size, incomplete model coverage (insufficient evaluation of legally fine-tuned models), static dataset that cannot reflect dynamic legal updates. Future directions: Expand dataset size, develop automated evaluation metrics, track the evolution of LLM capabilities, and explore improvements to model architectures dedicated to legal reasoning.