Zing Forum

Reading

Rosetta-PL: A New Benchmark for Building Reasoning Capabilities of Large Language Models Using Propositional Logic

Researchers created the Rosetta-PL benchmark by translating logical propositions from the Lean theorem prover into a custom logical language, which is used to systematically evaluate the performance of large language models on formal reasoning tasks.

大语言模型逻辑推理基准测试Rosetta-PLLean命题逻辑低资源语言模型评估
Published 2025-03-25 08:00Recent activity 2026-05-05 00:20Estimated read 5 min
Rosetta-PL: A New Benchmark for Building Reasoning Capabilities of Large Language Models Using Propositional Logic
1

Section 01

[Introduction] Rosetta-PL: A New Benchmark for Evaluating Logical Reasoning Capabilities of Large Language Models

Researchers created the Rosetta-PL benchmark by translating logical propositions from the Lean theorem prover into a custom logical language, which is used to systematically evaluate the performance of large language models on formal reasoning tasks. This benchmark reveals the patterns of how models learn logical rules, providing guidance for low-resource language applications and model training optimization.

2

Section 02

Research Background: Bottlenecks in LLM Reasoning Capabilities

Large language models (LLMs) are mainly trained on high-resource natural language data, leading to limitations such as restricted performance in low-resource languages and poor deep logical reasoning capabilities. Existing evaluation benchmarks struggle to distinguish whether models master logical rules or rely on pattern matching to complete tasks.

3

Section 03

Core Methodology of Rosetta-PL

Rosetta-PL is constructed in four steps: 1. Select propositions from the Lean theorem prover as data sources (advantages: logically correct, rich in structure, clear syntax and semantics); 2. Design a custom logical language (avoid symbols seen in pre-training to eliminate memory dependence); 3. Translate Lean propositions into the custom language to build a dataset; 4. Fine-tune LLMs (e.g., GPT-4o) using the dataset and evaluate their performance.

4

Section 04

Key Experimental Findings

The experiments yielded three key conclusions: 1. Preserving logical relationships during translation significantly improves reasoning accuracy, proving models can learn abstract rules; 2. Training data scale tends to saturate after 20,000 samples, and data quality is more important than quantity; 3. Models perform well in unfamiliar symbol systems, demonstrating true logical generalization capabilities.

5

Section 05

Implications for Low-Resource Language Applications

Implications of Rosetta-PL's findings for low-resource languages: 1. Structured representation (preserving underlying logical relationships) helps models learn; 2. Prioritize data quality and representation methods over scale; 3. Models may have the potential to learn any structured symbol system, providing a foundation for low-resource language adaptation.

6

Section 06

Technical Details and Implementation Key Points

Technical key points include: Translation strategy (maintaining logical equivalence, mapping operators, preserving quantifiers, etc.); Evaluation metrics (accuracy of truth value judgment, correctness rate of logical equivalence identification, etc.); Fine-tuning strategy (learning rate scheduling, regularization, few-shot example design, etc.).

7

Section 07

Limitations and Future Research Directions

Limitations: 1. Currently only covers propositional logic; 2. There is a gap with real-world natural language reasoning; 3. Mainly based on GPT-4o experiments. Future directions: Multimodal logical reasoning, incremental learning, interpretability analysis, cross-domain transfer, etc.