# East vs. West Large Models Code Capability Showdown: How Prompt Changes Affect Generation Quality

> A study from Chitkara University in India systematically evaluated the performance of six mainstream large language models (LLMs) in code generation tasks, with a special focus on how changes in prompt formats affect model outputs. The study used a composite evaluation framework, scoring comprehensively across four dimensions: functional accuracy, grammatical correctness, optimization quality, and response efficiency.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-12T19:12:20.000Z
- 最近活动: 2026-05-12T19:18:22.969Z
- 热度: 154.9
- 关键词: 大语言模型, 代码生成, 提示词工程, 模型评估, Claude, Kimi, GPT-4o, Gemini, AI编程, 软件工程
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-github-mayankbansal2004-benchmarking-large-language-models-for-code-generation-under-pr
- Canonical: https://www.zingnex.cn/forum/thread/llm-github-mayankbansal2004-benchmarking-large-language-models-for-code-generation-under-pr
- Markdown 来源: floors_fallback

---

## East vs. West Large Models Code Capability Showdown: Guide to How Prompt Changes Affect Generation Quality

A study by Chitkara University in India evaluated the code generation performance of six mainstream large language models (LLMs), focusing on how changes in prompt formats affect outputs. The participating models cover both Eastern and Western vendors: Western models include Claude 3.7 Sonnet, Gemini 2.0 Flash, and GPT-4o; Eastern models include GLM-4-Plus, MiniMax-M2, and Kimi K2 Instruct. The study used a four-dimensional evaluation framework (functional accuracy, grammatical correctness, optimization quality, response efficiency). Results show that Claude 3.7 Sonnet leads with an average score of 91.3%, followed closely by Kimi K2 Instruct (88.6%), and there are significant differences in the robustness of different models to prompt changes.

## Research Background: The Importance of Prompt Engineering

In LLM applications, the quality of prompts directly determines generation results, but user prompt habits vary greatly (detailed descriptions vs. minimal expressions). The Chitkara University team conducted this study to address this issue, designing a test set containing 150 programming tasks. For each task, three prompt formats (structured, semi-structured, and minimal) were prepared to simulate the diversity of real-world scenarios.

## Evaluation Framework: Four-Dimensional Comprehensive Scoring System

The study built a composite evaluation framework, scoring across four dimensions:
1. Functional accuracy: Whether the code correctly solves the problem (core indicator);
2. Grammatical correctness: Whether there are no syntax errors and the code can be executed directly;
3. Optimization quality: Time/space complexity and algorithm rationality;
4. Response efficiency: Generation speed and resource consumption.

## Participating Models: Eastern and Western Representatives Compete Side by Side

Six representative LLMs were selected:
Western camp: Claude 3.7 Sonnet (Anthropic), Gemini 2.0 Flash (Google), GPT-4o (OpenAI);
Eastern camp: GLM-4-Plus (Zhipu AI), MiniMax-M2 (MiniMax), Kimi K2 Instruct (Moonlight AI). The cross-regional selection makes the results more reference-worthy.

## Research Results and Key Findings

**Result Rankings**:
| Rank | Model | Origin | Average Score |
|---|---|---|---|
|1|Claude 3.7 Sonnet|Western|91.3%|
|2|Kimi K2 Instruct|Eastern|88.6%|
|3|Gemini 2.0 Flash|Western|87.0%|
|4|GLM-4-Plus|Eastern|84.2%|
|5|GPT-4o|Western|82.7%|
|6|MiniMax-M2|Eastern|81.5%|

**Key Findings**: There are significant differences in the sensitivity of different models to prompt format changes, with some models showing strong robustness; Eastern and Western models have distinct advantageous features (e.g., Eastern models excel in response efficiency, while Western models are better in optimization quality).

## Implications for Developers

1. Prompt engineering remains key to improving quality; writing clear and structured prompts is a best practice;
2. Model selection should be scenario-specific: choose Claude 3.7 Sonnet for ultimate quality, and Kimi K2 Instruct or Gemini 2.0 Flash for balancing quality and cost;
3. Before deployment, prompt robustness testing should be conducted to evaluate performance in real user scenarios.

## Research Limitations and Future Directions

**Limitations**: 1. The 150 tasks have limited coverage; 2. The evaluation focuses on static analysis, with little consideration for maintainability and readability.
**Future Directions**: Expand the test set size and include more programming language frameworks; explore more comprehensive code quality evaluation dimensions; conduct in-depth analysis of prompt robustness to guide model training optimization.