# Rosetta-PL: A New Benchmark for Building Reasoning Capabilities of Large Language Models Using Propositional Logic

> Researchers created the Rosetta-PL benchmark by translating logical propositions from the Lean theorem prover into a custom logical language, which is used to systematically evaluate the performance of large language models on formal reasoning tasks.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2025-03-25T00:00:00.000Z
- 最近活动: 2026-05-04T16:20:40.019Z
- 热度: 79.0
- 关键词: 大语言模型, 逻辑推理, 基准测试, Rosetta-PL, Lean, 命题逻辑, 低资源语言, 模型评估
- 页面链接: https://www.zingnex.cn/en/forum/thread/rosetta-pl
- Canonical: https://www.zingnex.cn/forum/thread/rosetta-pl
- Markdown 来源: floors_fallback

---

## [Introduction] Rosetta-PL: A New Benchmark for Evaluating Logical Reasoning Capabilities of Large Language Models

Researchers created the Rosetta-PL benchmark by translating logical propositions from the Lean theorem prover into a custom logical language, which is used to systematically evaluate the performance of large language models on formal reasoning tasks. This benchmark reveals the patterns of how models learn logical rules, providing guidance for low-resource language applications and model training optimization.

## Research Background: Bottlenecks in LLM Reasoning Capabilities

Large language models (LLMs) are mainly trained on high-resource natural language data, leading to limitations such as restricted performance in low-resource languages and poor deep logical reasoning capabilities. Existing evaluation benchmarks struggle to distinguish whether models master logical rules or rely on pattern matching to complete tasks.

## Core Methodology of Rosetta-PL

Rosetta-PL is constructed in four steps: 1. Select propositions from the Lean theorem prover as data sources (advantages: logically correct, rich in structure, clear syntax and semantics); 2. Design a custom logical language (avoid symbols seen in pre-training to eliminate memory dependence); 3. Translate Lean propositions into the custom language to build a dataset; 4. Fine-tune LLMs (e.g., GPT-4o) using the dataset and evaluate their performance.

## Key Experimental Findings

The experiments yielded three key conclusions: 1. Preserving logical relationships during translation significantly improves reasoning accuracy, proving models can learn abstract rules; 2. Training data scale tends to saturate after 20,000 samples, and data quality is more important than quantity; 3. Models perform well in unfamiliar symbol systems, demonstrating true logical generalization capabilities.

## Implications for Low-Resource Language Applications

Implications of Rosetta-PL's findings for low-resource languages: 1. Structured representation (preserving underlying logical relationships) helps models learn; 2. Prioritize data quality and representation methods over scale; 3. Models may have the potential to learn any structured symbol system, providing a foundation for low-resource language adaptation.

## Technical Details and Implementation Key Points

Technical key points include: Translation strategy (maintaining logical equivalence, mapping operators, preserving quantifiers, etc.); Evaluation metrics (accuracy of truth value judgment, correctness rate of logical equivalence identification, etc.); Fine-tuning strategy (learning rate scheduling, regularization, few-shot example design, etc.).

## Limitations and Future Research Directions

Limitations: 1. Currently only covers propositional logic; 2. There is a gap with real-world natural language reasoning; 3. Mainly based on GPT-4o experiments. Future directions: Multimodal logical reasoning, incremental learning, interpretability analysis, cross-domain transfer, etc.