Zing Forum

Reading

East vs. West Large Models Code Capability Showdown: How Prompt Changes Affect Generation Quality

A study from Chitkara University in India systematically evaluated the performance of six mainstream large language models (LLMs) in code generation tasks, with a special focus on how changes in prompt formats affect model outputs. The study used a composite evaluation framework, scoring comprehensively across four dimensions: functional accuracy, grammatical correctness, optimization quality, and response efficiency.

大语言模型代码生成提示词工程模型评估ClaudeKimiGPT-4oGeminiAI编程软件工程
Published 2026-05-13 03:12Recent activity 2026-05-13 03:18Estimated read 6 min
East vs. West Large Models Code Capability Showdown: How Prompt Changes Affect Generation Quality
1

Section 01

East vs. West Large Models Code Capability Showdown: Guide to How Prompt Changes Affect Generation Quality

A study by Chitkara University in India evaluated the code generation performance of six mainstream large language models (LLMs), focusing on how changes in prompt formats affect outputs. The participating models cover both Eastern and Western vendors: Western models include Claude 3.7 Sonnet, Gemini 2.0 Flash, and GPT-4o; Eastern models include GLM-4-Plus, MiniMax-M2, and Kimi K2 Instruct. The study used a four-dimensional evaluation framework (functional accuracy, grammatical correctness, optimization quality, response efficiency). Results show that Claude 3.7 Sonnet leads with an average score of 91.3%, followed closely by Kimi K2 Instruct (88.6%), and there are significant differences in the robustness of different models to prompt changes.

2

Section 02

Research Background: The Importance of Prompt Engineering

In LLM applications, the quality of prompts directly determines generation results, but user prompt habits vary greatly (detailed descriptions vs. minimal expressions). The Chitkara University team conducted this study to address this issue, designing a test set containing 150 programming tasks. For each task, three prompt formats (structured, semi-structured, and minimal) were prepared to simulate the diversity of real-world scenarios.

3

Section 03

Evaluation Framework: Four-Dimensional Comprehensive Scoring System

The study built a composite evaluation framework, scoring across four dimensions:

  1. Functional accuracy: Whether the code correctly solves the problem (core indicator);
  2. Grammatical correctness: Whether there are no syntax errors and the code can be executed directly;
  3. Optimization quality: Time/space complexity and algorithm rationality;
  4. Response efficiency: Generation speed and resource consumption.
4

Section 04

Participating Models: Eastern and Western Representatives Compete Side by Side

Six representative LLMs were selected: Western camp: Claude 3.7 Sonnet (Anthropic), Gemini 2.0 Flash (Google), GPT-4o (OpenAI); Eastern camp: GLM-4-Plus (Zhipu AI), MiniMax-M2 (MiniMax), Kimi K2 Instruct (Moonlight AI). The cross-regional selection makes the results more reference-worthy.

5

Section 05

Research Results and Key Findings

Result Rankings:

Rank Model Origin Average Score
1 Claude 3.7 Sonnet Western 91.3%
2 Kimi K2 Instruct Eastern 88.6%
3 Gemini 2.0 Flash Western 87.0%
4 GLM-4-Plus Eastern 84.2%
5 GPT-4o Western 82.7%
6 MiniMax-M2 Eastern 81.5%

Key Findings: There are significant differences in the sensitivity of different models to prompt format changes, with some models showing strong robustness; Eastern and Western models have distinct advantageous features (e.g., Eastern models excel in response efficiency, while Western models are better in optimization quality).

6

Section 06

Implications for Developers

  1. Prompt engineering remains key to improving quality; writing clear and structured prompts is a best practice;
  2. Model selection should be scenario-specific: choose Claude 3.7 Sonnet for ultimate quality, and Kimi K2 Instruct or Gemini 2.0 Flash for balancing quality and cost;
  3. Before deployment, prompt robustness testing should be conducted to evaluate performance in real user scenarios.
7

Section 07

Research Limitations and Future Directions

Limitations: 1. The 150 tasks have limited coverage; 2. The evaluation focuses on static analysis, with little consideration for maintainability and readability. Future Directions: Expand the test set size and include more programming language frameworks; explore more comprehensive code quality evaluation dimensions; conduct in-depth analysis of prompt robustness to guide model training optimization.