Section 01
[Introduction] Using the Same Optimizer for Pre-training and Fine-tuning Reduces Knowledge Forgetting
Studies have found that using the same optimizer for pre-training and fine-tuning achieves a better learning-forgetting trade-off, outperforming parameter-efficient fine-tuning methods like LoRA; it reveals the regularization effect of optimizers on model activations, and finds that the Muon optimizer has a tendency to rote memorize when fine-tuned on reasoning tasks.