Zing Forum

Reading

From Data to Code: A Systematic Study on Quality Issues of Code Large Language Models

The Software Engineering Laboratory of Sun Yat-sen University reviewed 114 papers, established a causal mapping framework between training data quality and generated code quality, and revealed how data defects propagate into code defects.

代码大模型数据质量代码质量系统性综述机器学习因果映射软件工程中山大学
Published 2026-05-10 21:53Recent activity 2026-05-10 22:03Estimated read 7 min
From Data to Code: A Systematic Study on Quality Issues of Code Large Language Models
1

Section 01

Introduction: Core Insights from the Systematic Study on Quality Issues of Code Large Language Models

The Software Engineering Laboratory of Sun Yat-sen University reviewed 114 papers, established a causal mapping framework between training data quality and generated code quality, and revealed how data defects propagate into code defects. The study proposes nine code quality dimensions, a classification system for training data quality issues, and 18 propagation mapping mechanisms, providing a systematic framework for improving the quality of code large language models.

2

Section 02

Research Background: The Neglected Upstream Data Issues of Code Large Language Models

Defects in modern Code Large Language Models (Code LLMs) often originate from upstream issues in training or fine-tuning data: low-quality training signals such as vulnerable snippets, noisy text, duplicate samples, distribution gaps, privacy leaks, and benchmark contamination. Through a systematic review of 114 related papers, the research team established the first complete causal chain from "problematic data" to "problematic code".

3

Section 03

Core Contributions: Code Quality Dimensions and Data Issue Classification System

Nine Code Quality Dimensions

  1. Correctness (syntax errors, logical flaws, API misuse)
  2. Security (design flaws, external vulnerabilities)
  3. Compliance (copyright infringement, privacy leaks, malicious code)
  4. Robustness (insufficient error handling, boundary condition failures)
  5. Maintainability (disorganized structure, low reusability)
  6. Understandability (poor naming conventions, lack of documentation)
  7. Efficiency (suboptimal time complexity, improper memory management)
  8. Output Conciseness (redundant logic, useless loops)
  9. Others (failure to follow instructions)

Classification of Training Data Quality Issues

  • Code attribute issues: Vulnerable code, duplicate code, low-quality code, API misuse examples
  • Non-code attribute issues: Natural language noise, distribution bias, privacy leaks, benchmark contamination
4

Section 04

Propagation Mapping Mechanisms: Paths from Data Defects to Code Defects

The study established 18 typical propagation mapping mechanisms to reveal how data defects transform into code defects:

  • Memory effect: The model remembers and reproduces vulnerability patterns in training data
  • Distribution shift: Differences between training data and target scenario distribution lead to generated code being unsuitable
  • Noise amplification: Minor noise in training data is amplified into obvious errors
  • Context contamination: Benchmark test data mixed into the training set leads to inflated evaluation results These mechanisms provide a theoretical framework for understanding code quality issues.
5

Section 05

Detection and Governance Strategies: Full-Lifecycle Quality Assurance

Code-level Detection

Static analysis tools, dynamic execution testing, security vulnerability scanning

Data-level Detection

Data deduplication algorithms, quality scoring models, privacy leak detection

Code-level Mitigation

Post-generation filtering, iterative refinement, Reinforcement Learning from Human Feedback (RLHF)

Data-level Mitigation

Data cleaning, curriculum learning, adversarial data augmentation

6

Section 06

Methodological Shifts and Future Challenge Directions

Methodological Shifts

Quality assurance shifts from reactive post-generation filtering to proactive data-centric governance and closed-loop repair:

  • Prioritize problem prevention at the training data stage
  • Establish full-lifecycle quality monitoring
  • Focus on data quality rather than just model scale

Open Challenges

  1. Complexity of causal inference: It is difficult to accurately quantify the causal relationship from data to code
  2. Intertwined multiple factors: It is hard to separate multiple coexisting data issues
  3. Dynamic evolution: Codebases and vulnerability patterns continue to change
  4. Evaluation dilemma: Model evaluation that avoids test set contamination

Future Directions

Reliable Code LLM development integrating data management, real-time data quality monitoring systems, automated data repair pipelines, cross-language and cross-domain generalization research

7

Section 07

Research Significance and Supporting Resources

Research Significance

  • For users: Explain the reasons for unstable prompt effects
  • For developers: Provide a systematic quality improvement framework
  • For the AI community: Emphasize that data quality is the cornerstone of model quality

Supporting Resources

  • Paper: arXiv:2605.05267
  • Official documentation: SYSUSELab.github.io/From-Data-to-Code
  • List of 114 selected papers
  • Visualized classification system and propagation mapping diagrams