Zing Forum

Reading

The Impact of Politeness on LLMs: A Cross-Lingual, Multi-Model Study Using the PLUM Corpus

This study uses the PLUM corpus to investigate the impact of polite language on the response quality of large language models (LLMs). The experiments cover 3 languages, 5 models, and 22,500 prompt-response pairs. It finds that polite prompts can improve response quality by approximately 11%, but the effect varies across languages and models and is not universal.

politenessLLM behaviorcross-linguisticmultilingualPLUM corpusprompt engineeringcultural differenceshuman-AI interaction
Published 2026-04-18 01:33Recent activity 2026-04-20 10:57Estimated read 8 min
The Impact of Politeness on LLMs: A Cross-Lingual, Multi-Model Study Using the PLUM Corpus
1

Section 01

[Introduction] The Impact of Politeness on LLM Response Quality: Core Overview of Cross-Lingual Multi-Model Research

The title of this paper is The Impact of Politeness on LLMs: A Cross-Lingual, Multi-Model Study Using the PLUM Corpus, which focuses on whether polite language affects the response quality of large language models (LLMs). The study covers 3 languages (English, Hindi, Spanish), 5 models (Gemini-Pro, GPT-4o Mini, Claude 3.7 Sonnet, DeepSeek-Chat, Llama 3), and 22,500 prompt-response pairs. Key findings: Polite prompts improve response quality by an average of about 11%, but the effect varies across languages (e.g., Hindi prefers respectful and indirect expressions, while Spanish favors firm and confident ones) and models (e.g., Llama 3 is most sensitive to tone, GPT-4o Mini is more robust). Additionally, the tone of dialogue history affects the quality of current responses. This study aims to reveal the rules of interaction between politeness and LLMs, providing references for users' communication strategies and developers' model optimization.

2

Section 02

Research Background and Theoretical Foundations

The study is based on two classic sociolinguistic theories:

  1. Brown and Levinson's Politeness Theory: Treats politeness as "facework", including positive face (need for recognition) and negative face (need for freedom of action). Polite language is used to maintain both parties' faces.
  2. Culpeper's Impoliteness Framework: Studies behaviors that intentionally attack or ignore others' faces (e.g., imperative tone, sarcasm, etc.). These theories provide an analytical framework for classifying the politeness level of prompts and observing differences in LLM responses. Moreover, as LLMs integrate into daily life, understanding the impact of politeness on their performance has practical significance, such as guiding users in effective communication and adjusting cross-cultural usage strategies.
3

Section 03

Research Methods: PLUM Corpus and Evaluation Framework

PLUM Corpus:

  • Sample size: 22,500 prompt-response pairs
  • Coverage: 3 languages (English, Hindi, Spanish), 5 models, 3 types of dialogue history (original, polite, impolite)
  • Politeness level annotation: 5 levels (very polite → very impolite), verified manually to ensure consistency. Evaluation Framework: 8 dimensions to comprehensively assess response quality, including coherence, clarity, depth, responsiveness, context retention, toxicity, conciseness, and readability.
4

Section 04

Core Findings: The Complexity of Politeness Effects

  1. Politeness indeed affects quality: Polite prompts improve response quality by an average of about 11% compared to neutral prompts, while impolite prompts lead to a decline in quality.
  2. Cross-language differences: English adapts to multiple tones; Hindi prefers respectful and indirect expressions; Spanish favors firm and confident expressions.
  3. Cross-model differences: Llama 3 is most sensitive to tone (effect range of 11.5%); GPT-4o Mini has strong robustness to impolite inputs; other models have moderate sensitivity.
  4. Impact of dialogue history: When the tone of previous dialogue is negative, even if the current prompt is polite, the response quality may still be affected.
5

Section 05

Analysis of the Causes of Differences

The differences stem from three aspects:

  1. Training data characteristics: Context distribution of training data for different languages (e.g., Hindi has more formal and respectful contexts), and the scale and diversity of model training data (e.g., GPT's diverse data makes it robust).
  2. Alignment training strategies: Manufacturers have different response designs for polite inputs (e.g., some models are explicitly trained to respond positively to polite prompts).
  3. Cultural bias reflection: Models may reflect cultural values in training data, such as insufficient sensitivity to other cultural norms due to English-centric training.
6

Section 06

Practical Implications: Recommendations for Users and Developers

User Recommendations:

  1. Maintain basic politeness (average 11% quality improvement);
  2. Consider language and culture (e.g., use respectful and indirect expressions for Hindi, direct and firm expressions for Spanish);
  3. Adapt to model characteristics (Llama 3 requires more politeness, GPT can be more direct);
  4. Maintain dialogue tone (avoid accumulation of negative history). Developer Recommendations:
  5. Cultural adaptation design (adjust sensitivity according to language/region);
  6. Robustness training (improve tolerance to impolite inputs);
  7. Transparent communication (inform users that tone affects responses).
7

Section 07

Limitations and Future Directions

Limitations:

  1. Limited language coverage (only 3 languages, missing Chinese, Arabic, etc.);
  2. Task type restrictions (only general dialogue, not involving professional tasks like programming);
  3. Static evaluation (single interaction, no study of long-term dynamics). Future Directions:
  4. Expand language coverage;
  5. Task-specific research (e.g., creative writing, code generation);
  6. Dynamic interaction research (evolution of politeness effects in multi-turn dialogues);
  7. Develop intervention strategies (enable models to maintain high-quality responses under various tones).