Zing Forum

Reading

Practical Evaluation of Grace Hopper 200: Analysis of React Native Application Generation Capabilities of Five Open-Source Code Models

This study evaluated five open-source code models—Kimi-K2.5, GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2—on the NVIDIA GH200 to assess their practical development capabilities through multi-file React Native application generation tasks. It found that SWE-Bench rankings cannot predict task performance, Kimi-K2.5 produced the best output under aggressive 3-bit quantization, and revealed deployment issues such as inference model sampling suspension, thought trace leakage, and Web adaptation gaps.

代码生成模型开源大模型React Native模型评测SWE-Bench模型量化跨平台开发
Published 2026-04-19 09:21Recent activity 2026-04-21 10:27Estimated read 7 min
Practical Evaluation of Grace Hopper 200: Analysis of React Native Application Generation Capabilities of Five Open-Source Code Models
1

Section 01

【Introduction】Core Summary of Grace Hopper 200 Practical Evaluation

This study evaluated the multi-file React Native application generation capabilities of five open-source code models—Kimi-K2.5, GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2—on the NVIDIA GH200. Key findings include: SWE-Bench rankings cannot predict actual task performance; Kimi-K2.5 produced the best output under aggressive 3-bit quantization; three deployment issues were revealed: inference model sampling suspension, thought trace leakage, and Web adaptation gaps.

2

Section 02

Evaluation Background and Motivation

As open-source code models become developer tools, existing benchmarks (e.g., SWE-Bench) only evaluate isolated code problems and fail to cover complex challenges in real development such as multi-file coordination and cross-platform compatibility. This study designed React Native application generation tasks to assess model capabilities in scenarios close to real-world practice.

3

Section 03

Evaluation Setup (Hardware, Models, Tasks, Criteria)

Hardware Platform: NVIDIA GH200 (576GB HBM3e memory) Participating Models: Kimi-K2.5 (Q3/Q4 quantization), GLM-5.1, Qwen3-Coder-480B, DeepSeek-V3.2 Evaluation Tasks: Generate multi-file React Native applications with user authentication, daily counting, and Web compatibility Evaluation Criteria: Out-of-the-box usability (runs directly without fixes), functional correctness

4

Section 04

Key Findings: Limitations of SWE-Bench and Unexpected Performance of Kimi-K2.5

  1. SWE-Bench Disconnects from Actual Performance: Models with high rankings in standard benchmarks may not perform well in actual tasks; existing benchmarks may focus too much on isolated problems, so model selection should not rely solely on a single benchmark.
  2. Unexpected Win of Kimi-K2.5: Its output under the 3-bit quantization (UD-Q3_K_XL) configuration was the most complete and standardized, surpassing models with higher SWE-Bench scores, indicating that quantization does not necessarily reduce quality, and there are limitations in architectural efficiency and evaluation metrics.
5

Section 05

Three Deployment Issues: Sampling Suspension, Thought Trace Leakage, and Web Adaptation Gaps

Finding 1: Temperature=0 Causes Sampling Suspension

  • Inference models tend to get stuck in loops when temperature=0 (fully deterministic sampling); it is recommended to use values between 0.1 and 0.2. Finding 2: Risk of Thought Trace Leakage
  • Model thought traces may leak sensitive information through tool parsers; filtering mechanisms need to be added to the toolchain. Finding 3: Web Adaptation Gap
  • All models insufficiently consider React Native Web compatibility and tend to generate code only for native platforms, reflecting a lack of cross-platform practices in training data.
6

Section 06

Hardware Hierarchy: Efficiency School vs. Scale School

In April 2026, the hardware hierarchy of open-source coding models is divided into two schools: Efficiency School: 10-15B active parameters, low hardware cost, SWE-Bench results comparable to the Scale School. Scale School: 32-40B active parameters, hardware cost about 7 times that of the Efficiency School, similar SWE-Bench scores. Cost-Effectiveness: The Efficiency School provides comparable benchmark results at 1/7 the cost, which is sufficient for most scenarios.

7

Section 07

Implications for Development Practice: Model Selection, Deployment, and Training Data Improvement

Model Selection Strategy: Go beyond a single benchmark, conduct actual tests on target tasks, and explore aggressive quantization configurations. Deployment Notes: Avoid temperature=0, add thought trace filtering, and verify cross-platform code. Training Data Improvement: Add cross-platform practices and Web compatibility examples to balance multi-platform coverage.

8

Section 08

Conclusions and Future Research Directions

Conclusions: Through evaluation of practical application tasks, this study found the limitations of SWE-Bench, the excellent quantization performance of Kimi-K2.5, and three deployment issues, providing guidance for development practice. Limitations: Single task, specific domain, results are prone to obsolescence. Future Directions: Expand the task set, continuously track model capabilities, and collect developer feedback.