Zing Forum

Reading

LLM Paraphrase Evaluation: A Study on Answer Consistency of Large Language Models in Multiple-Choice Questions

This project systematically evaluates the answer consistency of large language models (LLMs) when faced with paraphrased questions through natural language inference (NLI) filtering and multiple-choice commonsense question answering.

LLM评估改写一致性自然语言推理多项选择问答模型鲁棒性常识推理AI安全大语言模型
Published 2026-04-10 02:41Recent activity 2026-04-10 02:55Estimated read 8 min
LLM Paraphrase Evaluation: A Study on Answer Consistency of Large Language Models in Multiple-Choice Questions
1

Section 01

[Introduction] Core Overview of LLM Paraphrase Consistency Research

This study focuses on evaluating the paraphrase consistency of large language models (LLMs) in multiple-choice commonsense question answering tasks. It systematically analyzes the models' answer consistency when facing changes in question expression by filtering semantically equivalent paraphrased versions of questions using natural language inference (NLI). The research aims to reveal the current state of model robustness, providing empirical evidence and methodological support for improving the reliability of AI systems, guiding practical applications (such as education, healthcare, etc.), and promoting AI safety alignment.

2

Section 02

Research Background and Motivation: Why Focus on Paraphrase Consistency?

Large language models perform excellently in natural language processing tasks, but robustness and consistency remain key challenges. The core question is: Can models maintain consistent answers when the question expression is paraphrased but the semantics remain unchanged? This issue is crucial for real-world applications—if a model gives different answers to equivalent questions, its reliability will be seriously affected, potentially leading to severe consequences especially in high-precision demand fields such as education, healthcare, and law.

3

Section 03

Research Objectives and Evaluation Framework

Core Research Questions

  1. The proportion of consistent answers for paraphrased questions;
  2. Consistency differences among different models;
  3. Types of paraphrases that easily lead to inconsistency;
  4. Effectiveness of NLI filtering for semantically inconsistent paraphrases.

Evaluation Framework

  1. Data Preparation: Select multiple-choice commonsense question answering datasets;
  2. Paraphrase Generation: Use LLMs to generate diverse paraphrases of original questions;
  3. NLI Filtering: Screen semantically equivalent paraphrases;
  4. Model Inference: Target models answer original and paraphrased questions;
  5. Consistency Evaluation: Calculate answer consistency metrics.
4

Section 04

Technical Implementation and Toolchain Details

Project Structure (Jupyter Notebook)

  • 01_setup_and_data.ipynb: Environment configuration and data loading;
  • 02_paraphrase_generation.ipynb: Paraphrase generation and saving;
  • 03_NLI_filtering.ipynb: Semantically equivalent paraphrase screening;
  • 04_llm_inference.ipynb: Model inference and answer recording;
  • 05_evaluation_and_plots.ipynb: Metric calculation and visualization.

Key Technical Components

  • NLI: Judge the entailment relationship between paraphrases and original questions, retaining only equivalent paraphrases;
  • Multiple-choice Question Answering: Standardized format facilitates quantitative evaluation and cross-model comparison;
  • Consistency Metrics: Such as answer selection consistency rate, confidence change, etc.
5

Section 05

Research Findings and Methodological Contributions

Research Findings and Significance

Paraphrase consistency is an important indicator of LLM robustness, reflecting the model's understanding of the essence of the question rather than memorization of expressions. Its application implications include:

  • Model Selection: Treat consistency as a key evaluation dimension;
  • Prompt Engineering: Design more robust prompt strategies;
  • Answer Verification: Verify stability through multiple paraphrased versions;
  • Model Improvement: Guide training and fine-tuning directions.

Methodological Contributions

  • Systematic evaluation process with reusable Notebook implementations;
  • NLI filtering improves the reliability of paraphrase quality control;
  • Clear code structure for easy reproduction and extension.
6

Section 06

Current Limitations and Future Research Directions

Current Limitations

  • Dataset: Only focuses on commonsense question answering, not covering tasks like mathematical reasoning or code generation;
  • Paraphrase Types: Automatically generated paraphrases have limitations in diversity and naturalness;
  • Model Coverage: Due to API and resource constraints, not all mainstream models are covered.

Future Directions

  • Expand cross-task evaluation;
  • Adversarial paraphrase testing to explore the model's extreme robustness;
  • Explore training/fine-tuning techniques to improve consistency;
  • Compare automatic evaluation results with human judgment.
7

Section 07

Implications for AI Safety and Alignment

Paraphrase consistency is closely related to AI safety and alignment:

  • Models sensitive to expression may be maliciously exploited to induce inappropriate outputs through paraphrasing;
  • Inconsistency reflects a lack of transparency in model decisions, affecting interpretability. Improving consistency is both a performance issue and a safety issue.
8

Section 08

Conclusion: Key Indicator for Reliable AI Systems

This project evaluates LLM paraphrase consistency through a systematic approach, providing empirical results and a reusable toolchain. The research reminds us that while pursuing model performance, we need to pay attention to basic indicators such as robustness and consistency. As LLMs are increasingly applied in key fields, paraphrase consistency evaluation will become an important reference for building reliable AI systems, helping to create more trustworthy AI technologies.