Zing Forum

Reading

The Consistency Dilemma of LLM Recommendation Explanations: Reliable Explainer or Unreliable Narrator?

Recent research systematically evaluates the explanation consistency and sensitivity of large language models (LLMs) in group recommendation tasks. It finds that different models show significant differences in generating recommendation reasons, with some models exhibiting the characteristics of an "unreliable narrator," sounding an alarm for the application of LLMs in high-stakes recommendation scenarios.

LLM推荐系统可解释性一致性敏感性群体推荐模型评估推荐解释AI可信度
Published 2026-04-29 20:44Recent activity 2026-04-29 20:52Estimated read 7 min
The Consistency Dilemma of LLM Recommendation Explanations: Reliable Explainer or Unreliable Narrator?
1

Section 01

Main Floor: The Consistency Dilemma of LLM Recommendation Explanations—Reliable Explainer or Unreliable Narrator?

Recent research systematically evaluates the explanation consistency and sensitivity of large language models (LLMs) in group recommendation tasks. It finds that different models show significant differences in generating recommendation reasons, with some models exhibiting the characteristics of an "unreliable narrator," sounding an alarm for the application of LLMs in high-risk recommendation scenarios (e.g., healthcare, finance). The study focuses on explanation consistency (whether explanations remain consistent under the same recommendation decision) and sensitivity (whether explanation adjustments are reasonable when inputs change slightly). Through multi-model comparison experiments in group recommendation scenarios, key conclusions are drawn, emphasizing the need to prioritize the stability and credibility of explanations.

2

Section 02

Research Background and Motivation

The interpretability of recommendation systems is a focus of attention in academia and industry. Traditional methods generate explanations based on item features or user history. The introduction of LLMs brings the possibility of natural, fluent, and personalized recommendation reasons, but it also comes with risks: if explanations are contradictory in different contexts or overly sensitive to minor inputs, they become "unreliable narrators," which may cause serious consequences in high-risk scenarios.

3

Section 03

Core Research Questions

The study focuses on two key dimensions:

  1. Consistency: Under the same recommendation decision, do the explanations generated by LLMs remain consistent? For example, if the same recommendation is made today for cost-effectiveness and tomorrow for brand, there is a consistency problem.
  2. Sensitivity: When inputs change slightly, do explanations adjust reasonably and appropriately? Ideally, they should be sensitive to key information and robust to irrelevant noise. If the reason is reversed due to changes in irrelevant vocabulary in the prompt, it is overly sensitive.
4

Section 04

Experimental Design and Methods

The study designs an evaluation framework for group recommendation scenarios (which require balancing multiple users' preferences and making explanations more challenging). It uses multi-model comparison to test mainstream LLMs, and by controlling the degree and nature of input changes, it accurately measures the performance of each model in the dimensions of consistency and sensitivity.

5

Section 05

Key Findings (Evidence)

The study reveals the following patterns:

  1. Systematic differences between models: There are significant differences in explanation quality among different LLMs. Some are stable and consistent, while others have highly variable outputs for the same input, which is related to model architecture, training data, and alignment strategies.
  2. Consistency issues: Some LLMs have time inconsistency (different explanations for the same input at different times), context inconsistency (different reasons for the same recommendation in different contexts), and logical inconsistency (internal contradictions in explanations).
  3. Sensitivity spectrum: Models are distributed from "overly dull" to "overly sensitive". Overly sensitive models completely change their explanations due to minor wording adjustments in the prompt, making the system difficult to predict and debug.
6

Section 06

Implications for Practice (Recommendations)

Implications for teams deploying LLM recommendation systems:

  1. Model selection: In addition to accuracy, it is necessary to evaluate explanation consistency and sensitivity. Models that perform well in general tasks may lack stability.
  2. Post-hoc verification: In high-risk scenarios, it is recommended to add post-hoc verification of explanations, such as caching historical explanations to detect anomalies and using consistency checkers to mark suspicious drifts.
  3. Prompt engineering: Carefully designing system prompts can alleviate consistency issues, such as requiring adherence to a specific explanation framework.
7

Section 07

Limitations and Future Directions

Limitations: Focuses on group recommendation scenarios; applicability to other paradigms (sequential, conversational recommendation) needs to be verified; mainly focuses on English scenarios; multilingual performance needs to be explored. Future directions: Develop automated tools for evaluating explanation consistency; explore fine-tuning methods to improve explanation stability; study the relationship between user perception and objective consistency indicators.