# Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models: A Systematic Experimental Study

> This article introduces an experimental study on domestic large language models, exploring the impact of prompt politeness level on model output results. Through nine rounds of iterative experiments, the research team compared the performance of models such as DeepSeek, Doubao, and Qwen under prompts of different politeness levels, and found that politeness level may significantly affect the model's accuracy rate, refusal rate, and output stability.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-09T18:06:36.000Z
- 最近活动: 2026-04-09T18:18:29.619Z
- 热度: 152.8
- 关键词: 大语言模型, 提示工程, 礼貌提示, 国产模型, DeepSeek, 豆包, 通义千问, 模型评测, 提示词优化
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-acha-xwx-analysis-of-the-impact-of-prompt-politeness-on-the-output-of-chinese-la
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-acha-xwx-analysis-of-the-impact-of-prompt-politeness-on-the-output-of-chinese-la
- Markdown 来源: floors_fallback

---

## Introduction: Study on the Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models

This article conducts a systematic experiment on domestic large language models to explore the impact of prompt politeness level on model outputs. Through nine rounds of iterative experiments, the research team compared the performance of models such as DeepSeek, Doubao, and Qwen under prompts of different politeness levels, and found that politeness level may significantly affect the model's accuracy rate, refusal rate, and output stability. This study aims to fill the research gap related to domestic models in the Chinese context and provide empirical evidence for prompt engineering practice.

## Research Background and Motivation

## Research Background and Motivation

In human-machine conversations, users often use polite language, but whether these expressions affect the quality of model outputs remains unclear. Previous cross-language studies have shown that politeness level may affect model performance, but systematic research on domestic large language models is still lacking. This study focuses on the Chinese context and explores the systematic impact of polite prompts on the outputs of domestic models to fill this gap.

## Experimental Design and Methods

## Experimental Design and Methods

### Model Selection
- DeepSeek: An open-source model known for its reasoning ability
- Doubao: ByteDance's dialogue model
- Qwen: Alibaba's large language model series

### Experimental Process
- Question bank construction: Mainly Chinese objective questions, using authoritative datasets such as GAOKAO-Bench
- Prompt design: Versions of different politeness levels (from direct command to highly polite)
- Repeated experiments: Multiple tests for each question-model-politeness level combination
- Result extraction and statistics: Automated scripts to extract answers, using paired t-tests to evaluate significance

### Technical Implementation
Developed based on Python 3.10+, relying on `openai`, `requests`, `pandas`, and configuring model access information via `api_keys.json`.

## Evolution of Nine Rounds of Iterative Experiments

## Evolution of Nine Rounds of Iterative Experiments

- **Exploration phase (rounds 1-5):** Building the framework, adjusting prompt design, optimizing the question bank
- **Expansion phase (rounds 6-8):** Expanding to multiple models, discovering differences in model response speed and output characteristics
- **Deepening phase (round 9):** The largest-scale round, completing the full experiment for DeepSeek and partial tests for Doubao and Qwen

## Preliminary Findings and Challenges

## Preliminary Findings and Challenges

### Main Findings
- Polite prompts affect model outputs, but the direction and degree vary by model: Some models have higher accuracy under highly polite prompts, while others produce verbose outputs
- Robustness of answer extraction is a key challenge: Polite prompts lead to longer reasoning processes, increasing the difficulty of automated extraction
- Significant differences in model response characteristics: For example, differences in generation speed affect the feasibility of large-scale experiments

### Technical Challenges
- Question bank quality control: Early issues such as manual rewriting and inconsistent question types
- Result extraction accuracy: Automated extraction had cases of incorrect and missing extractions
- Timeout and truncation: Polite prompts increase output length, leading to API timeouts or response truncation

## Implications for Prompt Engineering Practice

## Implications for Prompt Engineering Practice

- Prompt design requires systematic thinking: Polite language may be a substantive factor affecting model behavior
- Model selection should be combined with specific scenarios: Different models have different sensitivities to prompt changes
- Evaluation process needs to be robust: When prompt changes lead to output format changes, the answer extraction logic needs to be adjusted

## Future Work Directions

## Future Work Directions

- Improve experiment coverage: Complete full experiments for Doubao and Qwen
- Improve question bank quality: Clean and validate the question bank to ensure consistency of questions, materials, and standard answers
- Deepen statistical analysis: Explore the underlying mechanisms of how politeness level affects model outputs
- Expand research scope: Explore the impact of other prompt features (such as concreteness, emotional tone)