# Practical Evaluation of Grace Hopper 200: Analysis of React Native Application Generation Capabilities of Five Open-Source Code Models

> This study evaluated five open-source code models—Kimi-K2.5, GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2—on the NVIDIA GH200 to assess their practical development capabilities through multi-file React Native application generation tasks. It found that SWE-Bench rankings cannot predict task performance, Kimi-K2.5 produced the best output under aggressive 3-bit quantization, and revealed deployment issues such as inference model sampling suspension, thought trace leakage, and Web adaptation gaps.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-19T01:21:02.000Z
- 最近活动: 2026-04-21T02:27:28.241Z
- 热度: 108.9
- 关键词: 代码生成模型, 开源大模型, React Native, 模型评测, SWE-Bench, 模型量化, 跨平台开发
- 页面链接: https://www.zingnex.cn/en/forum/thread/grace-hopper-200-react-native
- Canonical: https://www.zingnex.cn/forum/thread/grace-hopper-200-react-native
- Markdown 来源: floors_fallback

---

## 【Introduction】Core Summary of Grace Hopper 200 Practical Evaluation

This study evaluated the multi-file React Native application generation capabilities of five open-source code models—Kimi-K2.5, GLM-5.1, Qwen3-Coder-480B, and DeepSeek-V3.2—on the NVIDIA GH200. Key findings include: SWE-Bench rankings cannot predict actual task performance; Kimi-K2.5 produced the best output under aggressive 3-bit quantization; three deployment issues were revealed: inference model sampling suspension, thought trace leakage, and Web adaptation gaps.

## Evaluation Background and Motivation

As open-source code models become developer tools, existing benchmarks (e.g., SWE-Bench) only evaluate isolated code problems and fail to cover complex challenges in real development such as multi-file coordination and cross-platform compatibility. This study designed React Native application generation tasks to assess model capabilities in scenarios close to real-world practice.

## Evaluation Setup (Hardware, Models, Tasks, Criteria)

**Hardware Platform**: NVIDIA GH200 (576GB HBM3e memory)
**Participating Models**: Kimi-K2.5 (Q3/Q4 quantization), GLM-5.1, Qwen3-Coder-480B, DeepSeek-V3.2
**Evaluation Tasks**: Generate multi-file React Native applications with user authentication, daily counting, and Web compatibility
**Evaluation Criteria**: Out-of-the-box usability (runs directly without fixes), functional correctness

## Key Findings: Limitations of SWE-Bench and Unexpected Performance of Kimi-K2.5

1. **SWE-Bench Disconnects from Actual Performance**: Models with high rankings in standard benchmarks may not perform well in actual tasks; existing benchmarks may focus too much on isolated problems, so model selection should not rely solely on a single benchmark.
2. **Unexpected Win of Kimi-K2.5**: Its output under the 3-bit quantization (UD-Q3_K_XL) configuration was the most complete and standardized, surpassing models with higher SWE-Bench scores, indicating that quantization does not necessarily reduce quality, and there are limitations in architectural efficiency and evaluation metrics.

## Three Deployment Issues: Sampling Suspension, Thought Trace Leakage, and Web Adaptation Gaps

**Finding 1: Temperature=0 Causes Sampling Suspension**
- Inference models tend to get stuck in loops when temperature=0 (fully deterministic sampling); it is recommended to use values between 0.1 and 0.2.
**Finding 2: Risk of Thought Trace Leakage**
- Model thought traces may leak sensitive information through tool parsers; filtering mechanisms need to be added to the toolchain.
**Finding 3: Web Adaptation Gap**
- All models insufficiently consider React Native Web compatibility and tend to generate code only for native platforms, reflecting a lack of cross-platform practices in training data.

## Hardware Hierarchy: Efficiency School vs. Scale School

In April 2026, the hardware hierarchy of open-source coding models is divided into two schools:
**Efficiency School**: 10-15B active parameters, low hardware cost, SWE-Bench results comparable to the Scale School.
**Scale School**: 32-40B active parameters, hardware cost about 7 times that of the Efficiency School, similar SWE-Bench scores.
**Cost-Effectiveness**: The Efficiency School provides comparable benchmark results at 1/7 the cost, which is sufficient for most scenarios.

## Implications for Development Practice: Model Selection, Deployment, and Training Data Improvement

**Model Selection Strategy**: Go beyond a single benchmark, conduct actual tests on target tasks, and explore aggressive quantization configurations.
**Deployment Notes**: Avoid temperature=0, add thought trace filtering, and verify cross-platform code.
**Training Data Improvement**: Add cross-platform practices and Web compatibility examples to balance multi-platform coverage.

## Conclusions and Future Research Directions

**Conclusions**: Through evaluation of practical application tasks, this study found the limitations of SWE-Bench, the excellent quantization performance of Kimi-K2.5, and three deployment issues, providing guidance for development practice.
**Limitations**: Single task, specific domain, results are prone to obsolescence.
**Future Directions**: Expand the task set, continuously track model capabilities, and collect developer feedback.
