Reading

From Data to Code: A Systematic Study on Quality Issues of Code Large Language Models

代码大模型数据质量代码质量系统性综述机器学习因果映射软件工程中山大学

Published 2026-05-10 21:53Recent activity 2026-05-10 22:03Estimated read 7 min

From Data to Code: A Systematic Study on Quality Issues of Code Large Language Models

Section 01

Introduction: Core Insights from the Systematic Study on Quality Issues of Code Large Language Models

The Software Engineering Laboratory of Sun Yat-sen University reviewed 114 papers, established a causal mapping framework between training data quality and generated code quality, and revealed how data defects propagate into code defects. The study proposes nine code quality dimensions, a classification system for training data quality issues, and 18 propagation mapping mechanisms, providing a systematic framework for improving the quality of code large language models.

Section 02

Research Background: The Neglected Upstream Data Issues of Code Large Language Models

Defects in modern Code Large Language Models (Code LLMs) often originate from upstream issues in training or fine-tuning data: low-quality training signals such as vulnerable snippets, noisy text, duplicate samples, distribution gaps, privacy leaks, and benchmark contamination. Through a systematic review of 114 related papers, the research team established the first complete causal chain from "problematic data" to "problematic code".

Section 03

Core Contributions: Code Quality Dimensions and Data Issue Classification System

Nine Code Quality Dimensions

Correctness (syntax errors, logical flaws, API misuse)
Security (design flaws, external vulnerabilities)
Compliance (copyright infringement, privacy leaks, malicious code)
Robustness (insufficient error handling, boundary condition failures)
Maintainability (disorganized structure, low reusability)
Understandability (poor naming conventions, lack of documentation)
Efficiency (suboptimal time complexity, improper memory management)
Output Conciseness (redundant logic, useless loops)
Others (failure to follow instructions)

Classification of Training Data Quality Issues

Code attribute issues: Vulnerable code, duplicate code, low-quality code, API misuse examples
Non-code attribute issues: Natural language noise, distribution bias, privacy leaks, benchmark contamination

Section 04

Propagation Mapping Mechanisms: Paths from Data Defects to Code Defects

The study established 18 typical propagation mapping mechanisms to reveal how data defects transform into code defects:

Memory effect: The model remembers and reproduces vulnerability patterns in training data
Distribution shift: Differences between training data and target scenario distribution lead to generated code being unsuitable
Noise amplification: Minor noise in training data is amplified into obvious errors
Context contamination: Benchmark test data mixed into the training set leads to inflated evaluation results These mechanisms provide a theoretical framework for understanding code quality issues.

Section 05

Detection and Governance Strategies: Full-Lifecycle Quality Assurance

Code-level Detection

Static analysis tools, dynamic execution testing, security vulnerability scanning

Data-level Detection

Data deduplication algorithms, quality scoring models, privacy leak detection

Code-level Mitigation

Post-generation filtering, iterative refinement, Reinforcement Learning from Human Feedback (RLHF)

Data-level Mitigation

Data cleaning, curriculum learning, adversarial data augmentation

Section 06

Methodological Shifts and Future Challenge Directions

Methodological Shifts

Quality assurance shifts from reactive post-generation filtering to proactive data-centric governance and closed-loop repair:

Prioritize problem prevention at the training data stage
Establish full-lifecycle quality monitoring
Focus on data quality rather than just model scale

Open Challenges

Complexity of causal inference: It is difficult to accurately quantify the causal relationship from data to code
Intertwined multiple factors: It is hard to separate multiple coexisting data issues
Dynamic evolution: Codebases and vulnerability patterns continue to change
Evaluation dilemma: Model evaluation that avoids test set contamination

Future Directions

Reliable Code LLM development integrating data management, real-time data quality monitoring systems, automated data repair pipelines, cross-language and cross-domain generalization research

Section 07

Research Significance and Supporting Resources

Research Significance

For users: Explain the reasons for unstable prompt effects
For developers: Provide a systematic quality improvement framework
For the AI community: Emphasize that data quality is the cornerstone of model quality

Supporting Resources

Paper: arXiv:2605.05267
Official documentation: SYSUSELab.github.io/From-Data-to-Code
List of 114 selected papers
Visualized classification system and propagation mapping diagrams

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54