Zing Forum

Reading

PEFT-Arena: Re-examining Parameter-Efficient Fine-Tuning from the Stability-Plasticity Perspective

The Chinese University of Hong Kong, Westlake University, and the Max Planck Institute jointly proposed the PEFT-Arena benchmark, which for the first time systematically evaluates the trade-off between target task adaptation and pre-trained capability retention of parameter-efficient fine-tuning methods, revealing the advantages of Orthogonal Fine-Tuning (OFT) at the stability-plasticity frontier.

PEFT参数高效微调LoRAOFT大语言模型稳定性-可塑性模型遗忘正交微调LLM微调迁移学习
Published 2026-05-28 01:59Recent activity 2026-05-29 10:52Estimated read 8 min
PEFT-Arena: Re-examining Parameter-Efficient Fine-Tuning from the Stability-Plasticity Perspective
1

Section 01

PEFT-Arena: Re-examining Parameter-Efficient Fine-Tuning from the Stability-Plasticity Perspective

The Chinese University of Hong Kong, Westlake University, and the Max Planck Institute jointly proposed the PEFT-Arena benchmark, which for the first time systematically evaluates the trade-off between target task adaptation and pre-trained capability retention of parameter-efficient fine-tuning (PEFT) methods, revealing the advantages of Orthogonal Fine-Tuning (OFT) at the stability-plasticity frontier. This study fills the gap in the current PEFT evaluation paradigm that ignores pre-trained capability retention, providing a new perspective for understanding PEFT methods.

2

Section 02

Background and Motivation: Blind Spots in PEFT Evaluation and the Stability-Plasticity Dilemma

Parameter-efficient fine-tuning (PEFT) has become a de facto standard in the large language model field (e.g., LoRA, Adapter, Prompt Tuning), promising to adapt to downstream tasks with minimal computational overhead. However, current evaluations only focus on target task accuracy and ignore pre-trained capability retention—models may forget general abilities (such as instruction following and commonsense reasoning) when adapting to new tasks. This is exactly the "stability-plasticity dilemma" in cognitive science: plasticity refers to the ability to learn new domains, while stability refers to the degree of retaining pre-trained capabilities.

3

Section 03

PEFT-Arena Benchmark Design

PEFT-Arena is the first comprehensive benchmark that simultaneously evaluates target task performance and general capability retention, proposed by teams from The Chinese University of Hong Kong, Westlake University, and the Max Planck Institute for Intelligent Systems. The benchmark covers:

  • Model families: Qwen2.5-7B, Llama3.2-3B-Instruct
  • Training paradigms: Supervised Fine-Tuning (SFT), GRPO-based Reinforcement Learning (RLVR)
  • Task domains: Target tasks (mathematical reasoning, medical QA); General capabilities (IFEval instruction following, NQ natural QA, BBH benchmark) Each configuration reports target accuracy and average score of general capabilities.
4

Section 04

Key Findings: Stability-Plasticity Performance of PEFT Methods

Experiments reveal key phenomena:

  1. Full Fine-Tuning Cost: Target task performance improves but general capabilities plummet (e.g., Qwen math SFT: target accuracy from 35.30% →50.63%, general capabilities from 46.97%→34.22%).
  2. OFT Advantage: With comparable parameter counts, OFT maintains similar target performance while having minimal loss in general capabilities (OFT-block32 in Qwen math SFT: target 46.93%, general capabilities drop by only 2.6 percentage points).
  3. Catastrophic Failure of PiSSA: In some configurations, target performance does not improve but general capabilities are severely damaged (PiSSA in Llama math SFT: general capabilities from53.03%→9.74%).
  4. RLVR vs. SFT Differences: RLVR maintains relatively intact general capabilities while improving target performance.
5

Section 05

Mechanism Analysis from a Geometric Perspective

The differences between PEFT methods are explained from two geometric perspectives:

  • Weight Space Structure: OFT updates via orthogonal subspaces, avoiding interference with key directions of pre-trained knowledge; low-rank methods may introduce destructive perturbations in key singular vector directions.
  • Activation Space Stability: The "Capability-Conditioned Drift" metric is introduced to measure representation changes, and it is found that the degree of forgetting is closely related to non-isometric representation distortion—general capabilities are most severely lost when the geometric structure of the activation space is distorted.
6

Section 06

Path Backtracking Strategy: Finding a Better Operating Point

The study found that the final SFT checkpoint often "overshoots" the optimal trade-off point. By interpolating along the fine-tuning path, intermediate models with a better balance between target tasks and general capabilities can be found. Based on this, the "Path Backtracking" strategy is proposed: instead of using the final model, find a Pareto-optimal checkpoint in the optimization trajectory. This strategy does not increase training costs and significantly improves the comprehensive performance of the model.

7

Section 07

Practical Implications and Future Directions

Implications of the study for AI practice:

  1. Evaluating PEFT needs to focus on both target performance and general capability retention; a single metric is easily misleading.
  2. OFT's advantage in the stability-plasticity trade-off makes it a preferred choice for resource-constrained scenarios.
  3. The path backtracking strategy provides a plug-and-play improvement for existing fine-tuning processes, enhancing the reuse efficiency of foundation models. The team has open-sourced the code and benchmark data, and will further explore the theoretical basis of PEFT and topics related to model reliability and safety in the future.