Reading

Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models: A Systematic Experimental Study

This article introduces an experimental study on domestic large language models, exploring the impact of prompt politeness level on model output results. Through nine rounds of iterative experiments, the research team compared the performance of models such as DeepSeek, Doubao, and Qwen under prompts of different politeness levels, and found that politeness level may significantly affect the model's accuracy rate, refusal rate, and output stability.

大语言模型提示工程礼貌提示国产模型DeepSeek豆包通义千问模型评测提示词优化

Published 2026-04-10 02:06Recent activity 2026-04-10 02:18Estimated read 7 min

Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models: A Systematic Experimental Study

Section 01

Introduction: Study on the Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models

This article conducts a systematic experiment on domestic large language models to explore the impact of prompt politeness level on model outputs. Through nine rounds of iterative experiments, the research team compared the performance of models such as DeepSeek, Doubao, and Qwen under prompts of different politeness levels, and found that politeness level may significantly affect the model's accuracy rate, refusal rate, and output stability. This study aims to fill the research gap related to domestic models in the Chinese context and provide empirical evidence for prompt engineering practice.

Section 02

Research Background and Motivation

In human-machine conversations, users often use polite language, but whether these expressions affect the quality of model outputs remains unclear. Previous cross-language studies have shown that politeness level may affect model performance, but systematic research on domestic large language models is still lacking. This study focuses on the Chinese context and explores the systematic impact of polite prompts on the outputs of domestic models to fill this gap.

Section 03

Experimental Design and Methods

Model Selection

DeepSeek: An open-source model known for its reasoning ability
Doubao: ByteDance's dialogue model
Qwen: Alibaba's large language model series

Experimental Process

Question bank construction: Mainly Chinese objective questions, using authoritative datasets such as GAOKAO-Bench
Prompt design: Versions of different politeness levels (from direct command to highly polite)
Repeated experiments: Multiple tests for each question-model-politeness level combination
Result extraction and statistics: Automated scripts to extract answers, using paired t-tests to evaluate significance

Technical Implementation

Developed based on Python 3.10+, relying on openai, requests, pandas, and configuring model access information via api_keys.json.

Section 04

Evolution of Nine Rounds of Iterative Experiments

Exploration phase (rounds 1-5): Building the framework, adjusting prompt design, optimizing the question bank
Expansion phase (rounds 6-8): Expanding to multiple models, discovering differences in model response speed and output characteristics
Deepening phase (round 9): The largest-scale round, completing the full experiment for DeepSeek and partial tests for Doubao and Qwen

Section 05

Preliminary Findings and Challenges

Main Findings

Polite prompts affect model outputs, but the direction and degree vary by model: Some models have higher accuracy under highly polite prompts, while others produce verbose outputs
Robustness of answer extraction is a key challenge: Polite prompts lead to longer reasoning processes, increasing the difficulty of automated extraction
Significant differences in model response characteristics: For example, differences in generation speed affect the feasibility of large-scale experiments

Technical Challenges

Question bank quality control: Early issues such as manual rewriting and inconsistent question types
Result extraction accuracy: Automated extraction had cases of incorrect and missing extractions
Timeout and truncation: Polite prompts increase output length, leading to API timeouts or response truncation

Section 06

Implications for Prompt Engineering Practice

Prompt design requires systematic thinking: Polite language may be a substantive factor affecting model behavior
Model selection should be combined with specific scenarios: Different models have different sensitivities to prompt changes
Evaluation process needs to be robust: When prompt changes lead to output format changes, the answer extraction logic needs to be adjusted

Section 07

Future Work Directions

Improve experiment coverage: Complete full experiments for Doubao and Qwen
Improve question bank quality: Clean and validate the question bank to ensure consistency of questions, materials, and standard answers
Deepen statistical analysis: Explore the underlying mechanisms of how politeness level affects model outputs
Expand research scope: Explore the impact of other prompt features (such as concreteness, emotional tone)

Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models: A Systematic Experimental Study

Introduction: Study on the Impact of Prompt Politeness Level on Outputs of Domestic Large Language Models

Research Background and Motivation

Research Background and Motivation

Experimental Design and Methods

Experimental Design and Methods

Model Selection

Experimental Process

Technical Implementation

Evolution of Nine Rounds of Iterative Experiments

Evolution of Nine Rounds of Iterative Experiments

Preliminary Findings and Challenges

Preliminary Findings and Challenges

Main Findings

Technical Challenges

Implications for Prompt Engineering Practice

Implications for Prompt Engineering Practice

Future Work Directions

Future Work Directions

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100