Zing Forum

Reading

Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models: A Systematic Experimental Study

This article introduces an experimental study on domestic large language models, exploring the impact of prompt politeness level on model output results. Through nine rounds of iterative experiments, the research team compared the performance of models such as DeepSeek, Doubao, and Qwen under prompts of different politeness levels, and found that politeness level may significantly affect the model's accuracy rate, refusal rate, and output stability.

大语言模型提示工程礼貌提示国产模型DeepSeek豆包通义千问模型评测提示词优化
Published 2026-04-10 02:06Recent activity 2026-04-10 02:18Estimated read 7 min
Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models: A Systematic Experimental Study
1

Section 01

Introduction: Study on the Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models

This article conducts a systematic experiment on domestic large language models to explore the impact of prompt politeness level on model outputs. Through nine rounds of iterative experiments, the research team compared the performance of models such as DeepSeek, Doubao, and Qwen under prompts of different politeness levels, and found that politeness level may significantly affect the model's accuracy rate, refusal rate, and output stability. This study aims to fill the research gap related to domestic models in the Chinese context and provide empirical evidence for prompt engineering practice.

2

Section 02

Research Background and Motivation

Research Background and Motivation

In human-machine conversations, users often use polite language, but whether these expressions affect the quality of model outputs remains unclear. Previous cross-language studies have shown that politeness level may affect model performance, but systematic research on domestic large language models is still lacking. This study focuses on the Chinese context and explores the systematic impact of polite prompts on the outputs of domestic models to fill this gap.

3

Section 03

Experimental Design and Methods

Experimental Design and Methods

Model Selection

  • DeepSeek: An open-source model known for its reasoning ability
  • Doubao: ByteDance's dialogue model
  • Qwen: Alibaba's large language model series

Experimental Process

  • Question bank construction: Mainly Chinese objective questions, using authoritative datasets such as GAOKAO-Bench
  • Prompt design: Versions of different politeness levels (from direct command to highly polite)
  • Repeated experiments: Multiple tests for each question-model-politeness level combination
  • Result extraction and statistics: Automated scripts to extract answers, using paired t-tests to evaluate significance

Technical Implementation

Developed based on Python 3.10+, relying on openai, requests, pandas, and configuring model access information via api_keys.json.

4

Section 04

Evolution of Nine Rounds of Iterative Experiments

Evolution of Nine Rounds of Iterative Experiments

  • Exploration phase (rounds 1-5): Building the framework, adjusting prompt design, optimizing the question bank
  • Expansion phase (rounds 6-8): Expanding to multiple models, discovering differences in model response speed and output characteristics
  • Deepening phase (round 9): The largest-scale round, completing the full experiment for DeepSeek and partial tests for Doubao and Qwen
5

Section 05

Preliminary Findings and Challenges

Preliminary Findings and Challenges

Main Findings

  • Polite prompts affect model outputs, but the direction and degree vary by model: Some models have higher accuracy under highly polite prompts, while others produce verbose outputs
  • Robustness of answer extraction is a key challenge: Polite prompts lead to longer reasoning processes, increasing the difficulty of automated extraction
  • Significant differences in model response characteristics: For example, differences in generation speed affect the feasibility of large-scale experiments

Technical Challenges

  • Question bank quality control: Early issues such as manual rewriting and inconsistent question types
  • Result extraction accuracy: Automated extraction had cases of incorrect and missing extractions
  • Timeout and truncation: Polite prompts increase output length, leading to API timeouts or response truncation
6

Section 06

Implications for Prompt Engineering Practice

Implications for Prompt Engineering Practice

  • Prompt design requires systematic thinking: Polite language may be a substantive factor affecting model behavior
  • Model selection should be combined with specific scenarios: Different models have different sensitivities to prompt changes
  • Evaluation process needs to be robust: When prompt changes lead to output format changes, the answer extraction logic needs to be adjusted
7

Section 07

Future Work Directions

Future Work Directions

  • Improve experiment coverage: Complete full experiments for Doubao and Qwen
  • Improve question bank quality: Clean and validate the question bank to ensure consistency of questions, materials, and standard answers
  • Deepen statistical analysis: Explore the underlying mechanisms of how politeness level affects model outputs
  • Expand research scope: Explore the impact of other prompt features (such as concreteness, emotional tone)