Zing Forum

Reading

MCIBench: A Multilingual Code Intelligence Evaluation Benchmark for Systematically Assessing Large Models' Cross-Language Programming Capabilities

The ICTT team from Xidian University released the MCIBench benchmark, covering multiple programming languages, comprehensively evaluating large language models' multilingual code understanding, generation, and reasoning capabilities, and revealing the deep mechanisms of cross-language transfer learning.

代码智能多语言评测大语言模型基准测试跨语言迁移软件工程代码生成西安电子科技大学
Published 2026-05-20 15:42Recent activity 2026-05-20 15:48Estimated read 7 min
MCIBench: A Multilingual Code Intelligence Evaluation Benchmark for Systematically Assessing Large Models' Cross-Language Programming Capabilities
1

Section 01

MCIBench: Introduction to the Multilingual Code Intelligence Evaluation Benchmark

The ICTT team from Xidian University released MCIBench (Multilingual Code Intelligence Benchmark), a multilingual code intelligence evaluation benchmark covering multiple programming languages. It comprehensively assesses large language models' multilingual code understanding, generation, and reasoning capabilities, aiming to fill the standardization gap in the multilingual code evaluation field, reveal the deep mechanisms of cross-language transfer learning, and provide support for model optimization, tool selection, and academic research.

2

Section 02

Practical Challenges and Evaluation Needs of Multilingual Programming

In global software development, the coexistence of multiple languages (e.g., Python, JavaScript, Go, Rust, Java) places high demands on the cross-language capabilities of developers and AI programming assistants. Current mainstream large models perform well in Python tasks, but their performance in other languages decays significantly, exposing issues such as uneven distribution of training data and imperfect cross-language transfer mechanisms. There is an urgent need for a systematic and standardized multilingual code intelligence evaluation system.

3

Section 03

Overview of the MCIBench Project

MCIBench is developed by the ICTT-GZ team of Xidian University. It is a comprehensive evaluation benchmark that emphasizes the balance between breadth (covering the complete ecosystem of multiple languages) and depth (disassembling multiple dimensions of code intelligence). Its core value lies in filling the standardization gap in multilingual code evaluation, providing optimization directions for model developers, and offering data support for users to select AI programming tools.

4

Section 04

Evaluation Dimensions and Methodology of MCIBench

The evaluation dimensions include: 1. Code understanding ability (semantic analysis, variable tracking, etc.); 2. Code generation ability (functional correctness, style consistency, etc.); 3. Cross-language transfer ability (comparison of language-agnostic algorithm tasks); 4. Reasoning and debugging ability (code review, defect localization, etc.). The methodology adopts a strategy combining automated testing (objective verification) and manual evaluation (subjective factors).

5

Section 05

Technical Implementation and Dataset Construction

MCIBench adopts a modular architecture (decoupling of data loading, model interfaces, etc.). The dataset sources include sampling from open-source code repositories (high-quality samples from GitHub, filtered through copyright review), manually annotated tasks (standard answers written by professional developers), and integration of existing benchmarks (compatible with HumanEval, MBPP, etc.). Preprocessing includes deduplication, desensitization, syntax verification, and there is a continuous update mechanism to maintain timeliness.

6

Section 06

Experimental Findings and Key Insights

Preliminary experiments reveal: 1. Power-law distribution of language proficiency (high-frequency languages like Python perform prominently, while niche languages like Rust have obvious gaps); 2. Asymmetry of cross-language transfer (significant decay from high-frequency to low-frequency languages, limited improvement in the reverse direction); 3. Differences in task type sensitivity (code completion has low language sensitivity, while complex algorithm generation has strong dependency).

7

Section 07

Application Scenarios and Ecological Value

For model developers: Fine-grained capability diagnosis to guide training data collection and fine-tuning; For tool selectors: Reference for choosing AI programming assistants in multilingual projects; For academic research: A public experimental platform to promote cross-institutional comparison and methodological progress.

8

Section 08

Future Outlook and Community Collaboration

Short-term: Expand language coverage; Mid-term: Introduce project-level evaluation tasks; Long-term: Establish a cross-modal code intelligence evaluation system. As an open infrastructure, MCIBench welcomes community contributions and collaboration to push the boundaries of AI programming capabilities.