# From Data to Code: A Systematic Study on Quality Issues of Code Large Language Models

> The Software Engineering Laboratory of Sun Yat-sen University reviewed 114 papers, established a causal mapping framework between training data quality and generated code quality, and revealed how data defects propagate into code defects.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-10T13:53:11.000Z
- 最近活动: 2026-05-10T14:03:05.576Z
- 热度: 150.8
- 关键词: 代码大模型, 数据质量, 代码质量, 系统性综述, 机器学习, 因果映射, 软件工程, 中山大学
- 页面链接: https://www.zingnex.cn/en/forum/thread/geo-github-sysuselab-from-data-to-code
- Canonical: https://www.zingnex.cn/forum/thread/geo-github-sysuselab-from-data-to-code
- Markdown 来源: floors_fallback

---

## Introduction: Core Insights from the Systematic Study on Quality Issues of Code Large Language Models

The Software Engineering Laboratory of Sun Yat-sen University reviewed 114 papers, established a causal mapping framework between training data quality and generated code quality, and revealed how data defects propagate into code defects. The study proposes nine code quality dimensions, a classification system for training data quality issues, and 18 propagation mapping mechanisms, providing a systematic framework for improving the quality of code large language models.

## Research Background: The Neglected Upstream Data Issues of Code Large Language Models

Defects in modern Code Large Language Models (Code LLMs) often originate from upstream issues in training or fine-tuning data: low-quality training signals such as vulnerable snippets, noisy text, duplicate samples, distribution gaps, privacy leaks, and benchmark contamination. Through a systematic review of 114 related papers, the research team established the first complete causal chain from "problematic data" to "problematic code".

## Core Contributions: Code Quality Dimensions and Data Issue Classification System

### Nine Code Quality Dimensions
1. Correctness (syntax errors, logical flaws, API misuse)
2. Security (design flaws, external vulnerabilities)
3. Compliance (copyright infringement, privacy leaks, malicious code)
4. Robustness (insufficient error handling, boundary condition failures)
5. Maintainability (disorganized structure, low reusability)
6. Understandability (poor naming conventions, lack of documentation)
7. Efficiency (suboptimal time complexity, improper memory management)
8. Output Conciseness (redundant logic, useless loops)
9. Others (failure to follow instructions)

### Classification of Training Data Quality Issues
- Code attribute issues: Vulnerable code, duplicate code, low-quality code, API misuse examples
- Non-code attribute issues: Natural language noise, distribution bias, privacy leaks, benchmark contamination

## Propagation Mapping Mechanisms: Paths from Data Defects to Code Defects

The study established 18 typical propagation mapping mechanisms to reveal how data defects transform into code defects:
- Memory effect: The model remembers and reproduces vulnerability patterns in training data
- Distribution shift: Differences between training data and target scenario distribution lead to generated code being unsuitable
- Noise amplification: Minor noise in training data is amplified into obvious errors
- Context contamination: Benchmark test data mixed into the training set leads to inflated evaluation results
These mechanisms provide a theoretical framework for understanding code quality issues.

## Detection and Governance Strategies: Full-Lifecycle Quality Assurance

### Code-level Detection
Static analysis tools, dynamic execution testing, security vulnerability scanning

### Data-level Detection
Data deduplication algorithms, quality scoring models, privacy leak detection

### Code-level Mitigation
Post-generation filtering, iterative refinement, Reinforcement Learning from Human Feedback (RLHF)

### Data-level Mitigation
Data cleaning, curriculum learning, adversarial data augmentation

## Methodological Shifts and Future Challenge Directions

### Methodological Shifts
Quality assurance shifts from reactive post-generation filtering to proactive data-centric governance and closed-loop repair:
- Prioritize problem prevention at the training data stage
- Establish full-lifecycle quality monitoring
- Focus on data quality rather than just model scale

### Open Challenges
1. Complexity of causal inference: It is difficult to accurately quantify the causal relationship from data to code
2. Intertwined multiple factors: It is hard to separate multiple coexisting data issues
3. Dynamic evolution: Codebases and vulnerability patterns continue to change
4. Evaluation dilemma: Model evaluation that avoids test set contamination

### Future Directions
Reliable Code LLM development integrating data management, real-time data quality monitoring systems, automated data repair pipelines, cross-language and cross-domain generalization research

## Research Significance and Supporting Resources

### Research Significance
- For users: Explain the reasons for unstable prompt effects
- For developers: Provide a systematic quality improvement framework
- For the AI community: Emphasize that data quality is the cornerstone of model quality

### Supporting Resources
- Paper: arXiv:2605.05267
- Official documentation: SYSUSELab.github.io/From-Data-to-Code
- List of 114 selected papers
- Visualized classification system and propagation mapping diagrams
