Zing Forum

Reading

Beyond Summaries: LLM-Driven Structured Annotation of Code Changes

The study proposes a two-stage pipeline for structured annotation of code patches, identifying change types such as renaming, moving, and logic modification. The optimal configuration achieves an 84% recall rate and 81% precision, providing new ideas for code review automation.

代码审查代码变更分析结构化标注LLM少样本学习软件工程diff分析代码分类
Published 2026-05-26 01:56Recent activity 2026-05-26 12:57Estimated read 5 min
Beyond Summaries: LLM-Driven Structured Annotation of Code Changes
1

Section 01

[Introduction] LLM-Driven Structured Annotation of Code Changes: Addressing the Scalability Challenges of Code Reviews

This paper proposes a two-stage pipeline that uses large language models (LLMs) for structured annotation of code patches, identifying change types (e.g., renaming, moving, logic modification) and their relational attributes. The optimal configuration (GPT-4 + optimized prompts) achieves an 84% recall rate and 81% precision, offering new insights for code review automation and enabling scenarios like intelligent routing and priority sorting.

2

Section 02

Background: Scalability Challenges of Code Reviews and Limitations of Existing Methods

Code review is a key practice in software engineering, but scaling brings challenges: surging patch volumes, increasing change complexity, and AI-assisted programming exacerbating review burdens. Limitations of existing LLM methods: inconsistent summary quality hinders automated decision-making; generated review comments are prone to false positives and fail to capture overall intent. This paper proposes a new direction of structured annotation.

3

Section 03

Methodology: Two-Stage Structured Annotation Pipeline and Few-Shot Prompting Strategy

Two-Stage Pipeline: 1. Hunk-level annotation: Split patches into diff hunks and classify them into Rename, Move, Logic Change, etc. 2. Relationship and attribute refinement: Identify structural relationships like rename propagation and dependencies, as well as semantic attributes like Breaking Change. Few-Shot Prompting: No fine-tuning required; cross-language adaptation is achieved via context construction (full diff + sliding window), example selection (2-3 boundary examples per category), and structured JSON output.

4

Section 04

Experimental Evidence: Model Performance Against Human-Annotated Benchmarks

A benchmark dataset of natural + synthetic patches was constructed to test GPT-4, Claude3, Llama3, and CodeLlama. The optimal configuration (GPT-4) achieved 84% recall, 81% precision, and an F1 score of 82.5%. Fine-grained analysis: Rename recognition had >90% F1; Logic Change was prone to confusion; context lengths within 8K tokens showed significant benefits; GPT-4/Claude3 outperformed open-source models.

5

Section 05

Application Value: Optimization Scenarios for Code Review Workflows

  1. Intelligent routing: Assign reviewers based on change type (security changes → security team). 2. Priority sorting: Prioritize high-risk changes (Breaking Change + core module → high priority). 3. Review assistance: Provide structured prompts (e.g., confirm rename completeness). 4. Change analysis: Team-level insights (proportion of refactoring changes this week).
6

Section 06

Limitations and Future Work: Current Challenges and Improvement Directions

Limitations: Differences in annotation consistency, ambiguous boundaries between change types, difficulty managing context for large-scale patches, and high LLM invocation costs. Future Directions: Hybrid static analysis and LLMs, active learning to improve annotations, temporal modeling considering historical patterns, multi-modal expansion integrating CI results and other information.