# Beyond Summaries: LLM-Driven Structured Annotation of Code Changes

> The study proposes a two-stage pipeline for structured annotation of code patches, identifying change types such as renaming, moving, and logic modification. The optimal configuration achieves an 84% recall rate and 81% precision, providing new ideas for code review automation.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-25T17:56:46.000Z
- 最近活动: 2026-05-26T04:57:02.778Z
- 热度: 131.0
- 关键词: 代码审查, 代码变更分析, 结构化标注, LLM, 少样本学习, 软件工程, diff分析, 代码分类
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-26100v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-26100v1
- Markdown 来源: floors_fallback

---

## [Introduction] LLM-Driven Structured Annotation of Code Changes: Addressing the Scalability Challenges of Code Reviews

This paper proposes a two-stage pipeline that uses large language models (LLMs) for structured annotation of code patches, identifying change types (e.g., renaming, moving, logic modification) and their relational attributes. The optimal configuration (GPT-4 + optimized prompts) achieves an 84% recall rate and 81% precision, offering new insights for code review automation and enabling scenarios like intelligent routing and priority sorting.

## Background: Scalability Challenges of Code Reviews and Limitations of Existing Methods

Code review is a key practice in software engineering, but scaling brings challenges: surging patch volumes, increasing change complexity, and AI-assisted programming exacerbating review burdens. Limitations of existing LLM methods: inconsistent summary quality hinders automated decision-making; generated review comments are prone to false positives and fail to capture overall intent. This paper proposes a new direction of structured annotation.

## Methodology: Two-Stage Structured Annotation Pipeline and Few-Shot Prompting Strategy

**Two-Stage Pipeline**: 1. Hunk-level annotation: Split patches into diff hunks and classify them into Rename, Move, Logic Change, etc. 2. Relationship and attribute refinement: Identify structural relationships like rename propagation and dependencies, as well as semantic attributes like Breaking Change. **Few-Shot Prompting**: No fine-tuning required; cross-language adaptation is achieved via context construction (full diff + sliding window), example selection (2-3 boundary examples per category), and structured JSON output.

## Experimental Evidence: Model Performance Against Human-Annotated Benchmarks

A benchmark dataset of natural + synthetic patches was constructed to test GPT-4, Claude3, Llama3, and CodeLlama. The optimal configuration (GPT-4) achieved 84% recall, 81% precision, and an F1 score of 82.5%. Fine-grained analysis: Rename recognition had >90% F1; Logic Change was prone to confusion; context lengths within 8K tokens showed significant benefits; GPT-4/Claude3 outperformed open-source models.

## Application Value: Optimization Scenarios for Code Review Workflows

1. Intelligent routing: Assign reviewers based on change type (security changes → security team). 2. Priority sorting: Prioritize high-risk changes (Breaking Change + core module → high priority). 3. Review assistance: Provide structured prompts (e.g., confirm rename completeness). 4. Change analysis: Team-level insights (proportion of refactoring changes this week).

## Limitations and Future Work: Current Challenges and Improvement Directions

**Limitations**: Differences in annotation consistency, ambiguous boundaries between change types, difficulty managing context for large-scale patches, and high LLM invocation costs. **Future Directions**: Hybrid static analysis and LLMs, active learning to improve annotations, temporal modeling considering historical patterns, multi-modal expansion integrating CI results and other information.