# LLM Paraphrase Evaluation: A Study on Answer Consistency of Large Language Models in Multiple-Choice Questions

> This project systematically evaluates the answer consistency of large language models (LLMs) when faced with paraphrased questions through natural language inference (NLI) filtering and multiple-choice commonsense question answering.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-09T18:41:03.000Z
- 最近活动: 2026-04-09T18:55:42.879Z
- 热度: 159.8
- 关键词: LLM评估, 改写一致性, 自然语言推理, 多项选择问答, 模型鲁棒性, 常识推理, AI安全, 大语言模型
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-3a93e99e
- Canonical: https://www.zingnex.cn/forum/thread/llm-3a93e99e
- Markdown 来源: floors_fallback

---

## [Introduction] Core Overview of LLM Paraphrase Consistency Research

This study focuses on evaluating the paraphrase consistency of large language models (LLMs) in multiple-choice commonsense question answering tasks. It systematically analyzes the models' answer consistency when facing changes in question expression by filtering semantically equivalent paraphrased versions of questions using natural language inference (NLI). The research aims to reveal the current state of model robustness, providing empirical evidence and methodological support for improving the reliability of AI systems, guiding practical applications (such as education, healthcare, etc.), and promoting AI safety alignment.

## Research Background and Motivation: Why Focus on Paraphrase Consistency?

Large language models perform excellently in natural language processing tasks, but robustness and consistency remain key challenges. The core question is: Can models maintain consistent answers when the question expression is paraphrased but the semantics remain unchanged? This issue is crucial for real-world applications—if a model gives different answers to equivalent questions, its reliability will be seriously affected, potentially leading to severe consequences especially in high-precision demand fields such as education, healthcare, and law.

## Research Objectives and Evaluation Framework

### Core Research Questions
1. The proportion of consistent answers for paraphrased questions;
2. Consistency differences among different models;
3. Types of paraphrases that easily lead to inconsistency;
4. Effectiveness of NLI filtering for semantically inconsistent paraphrases.

### Evaluation Framework
1. **Data Preparation**: Select multiple-choice commonsense question answering datasets;
2. **Paraphrase Generation**: Use LLMs to generate diverse paraphrases of original questions;
3. **NLI Filtering**: Screen semantically equivalent paraphrases;
4. **Model Inference**: Target models answer original and paraphrased questions;
5. **Consistency Evaluation**: Calculate answer consistency metrics.

## Technical Implementation and Toolchain Details

### Project Structure (Jupyter Notebook)
- 01_setup_and_data.ipynb: Environment configuration and data loading;
- 02_paraphrase_generation.ipynb: Paraphrase generation and saving;
- 03_NLI_filtering.ipynb: Semantically equivalent paraphrase screening;
- 04_llm_inference.ipynb: Model inference and answer recording;
- 05_evaluation_and_plots.ipynb: Metric calculation and visualization.

### Key Technical Components
- **NLI**: Judge the entailment relationship between paraphrases and original questions, retaining only equivalent paraphrases;
- **Multiple-choice Question Answering**: Standardized format facilitates quantitative evaluation and cross-model comparison;
- **Consistency Metrics**: Such as answer selection consistency rate, confidence change, etc.

## Research Findings and Methodological Contributions

### Research Findings and Significance
Paraphrase consistency is an important indicator of LLM robustness, reflecting the model's understanding of the essence of the question rather than memorization of expressions. Its application implications include:
- **Model Selection**: Treat consistency as a key evaluation dimension;
- **Prompt Engineering**: Design more robust prompt strategies;
- **Answer Verification**: Verify stability through multiple paraphrased versions;
- **Model Improvement**: Guide training and fine-tuning directions.

### Methodological Contributions
- Systematic evaluation process with reusable Notebook implementations;
- NLI filtering improves the reliability of paraphrase quality control;
- Clear code structure for easy reproduction and extension.

## Current Limitations and Future Research Directions

### Current Limitations
- **Dataset**: Only focuses on commonsense question answering, not covering tasks like mathematical reasoning or code generation;
- **Paraphrase Types**: Automatically generated paraphrases have limitations in diversity and naturalness;
- **Model Coverage**: Due to API and resource constraints, not all mainstream models are covered.

### Future Directions
- Expand cross-task evaluation;
- Adversarial paraphrase testing to explore the model's extreme robustness;
- Explore training/fine-tuning techniques to improve consistency;
- Compare automatic evaluation results with human judgment.

## Implications for AI Safety and Alignment

Paraphrase consistency is closely related to AI safety and alignment:
- Models sensitive to expression may be maliciously exploited to induce inappropriate outputs through paraphrasing;
- Inconsistency reflects a lack of transparency in model decisions, affecting interpretability. Improving consistency is both a performance issue and a safety issue.

## Conclusion: Key Indicator for Reliable AI Systems

This project evaluates LLM paraphrase consistency through a systematic approach, providing empirical results and a reusable toolchain. The research reminds us that while pursuing model performance, we need to pay attention to basic indicators such as robustness and consistency. As LLMs are increasingly applied in key fields, paraphrase consistency evaluation will become an important reference for building reliable AI systems, helping to create more trustworthy AI technologies.
