Zing Forum

Reading

RespondeoQA: The First Latin-English Bilingual Question Answering Benchmark Dataset Released

RespondeoQA is the first question answering benchmark dataset focused on Latin, containing approximately 7800 Latin-English bilingual question-answer pairs covering various types such as knowledge-based, skill-based, multi-hop reasoning, and translation-constrained questions. The research team evaluated LLaMa 3, Qwen QwQ, and o3-mini and found that current large models perform poorly on Latin skill-based questions, providing a crucial resource for model capability assessment in this domain.

拉丁语问答基准双语数据集大模型评估古典语言自然语言处理LLaMaQwen低资源语言
Published 2026-04-23 00:24Recent activity 2026-04-23 10:48Estimated read 5 min
RespondeoQA: The First Latin-English Bilingual Question Answering Benchmark Dataset Released
1

Section 01

RespondeoQA: The First Latin-English Bilingual Question Answering Benchmark Dataset Released (Introduction)

RespondeoQA is the first question answering benchmark dataset focused on Latin, containing approximately 7800 Latin-English bilingual question-answer pairs covering various types such as knowledge-based, skill-based, multi-hop reasoning, and translation-constrained questions. The research team evaluated LLaMa 3, Qwen QwQ, and o3-mini and found that current large models perform poorly on Latin skill-based questions, providing a crucial resource for model capability assessment in this domain.

2

Section 02

Background: The Neglected Status of Classical Languages in the AI Field

As the cornerstone of Western civilization, Latin still has a profound impact in fields such as law, medicine, theology, and academic nomenclature to this day. However, most existing natural language processing benchmarks focus on modern mainstream languages, and systematic evaluation of classical languages is almost non-existent.

3

Section 03

Dataset Construction Methods and Characteristics

RespondeoQA's data sources include exam questions, knowledge competition questions, and textbook content from the 19th century to the present; the construction process undergoes three checks: automated extraction, data cleaning, and manual review; question types cover knowledge-based (vocabulary, grammar, historical culture), skill-based (poetic meter analysis, rhetoric recognition), multi-hop reasoning, translation constraints, and mixed language pairs.

4

Section 04

Model Evaluation Results: Significant Underperformance on Skill-Based Questions

The research team selected three models—LLaMa 3, Qwen QwQ, and OpenAI o3-mini—for evaluation. The results show that all models perform significantly worse on skill-based questions than on knowledge-based ones; reasoning models (QwQ and o3-mini) have certain advantages in poetic meter analysis and rhetoric recognition but with limited improvement; QwQ performs slightly better on questions posed in Latin, while LLaMa 3 and o3-mini are more task-dependent.

5

Section 05

Technical Significance and Academic Value of RespondeoQA

RespondeoQA fills the gap in classical language question answering benchmarks, providing a standardized tool for evaluating low-resource classical language models; its construction method can be transferred to other classical or endangered languages, supporting the protection of linguistic diversity; it can be used as an auxiliary tool for Latin teaching to test learners' knowledge mastery; and promotes the inheritance of humanistic knowledge in the digital age.

6

Section 06

Limitations and Future Outlook

The current evaluation only covers three models with a limited sample size; the questions in the dataset are mainly from teaching scenarios, with insufficient coverage of complex academic and literary creation scenarios. In the future, it can be extended to more open-source and closed-source models to form a comprehensive capability map; strengthen coverage of complex scenarios; and transfer the construction process to classical languages such as Ancient Greek and Sanskrit to build an integrated evaluation system.