Zing Forum

Reading

DAMF: Addressing Fine-tuning Failure of Vision-Language Models Under Extreme Physical Domain Transfer

When vision-language models face extreme physical domain transfer such as underwater imaging, traditional joint fine-tuning is not only ineffective but also actively degrades model performance. This article introduces the two-stage optimization protocol DAMF, which isolates visual realignment and controlled multimodal coupling to nearly triple BLEU-4 scores in underwater image captioning tasks.

视觉语言模型域迁移多模态学习BLIP水下图像微调优化ECCV2026
Published 2026-04-25 18:31Recent activity 2026-04-25 18:51Estimated read 7 min
DAMF: Addressing Fine-tuning Failure of Vision-Language Models Under Extreme Physical Domain Transfer
1

Section 01

Introduction: DAMF Addresses VLM Fine-tuning Failure Under Extreme Physical Domain Transfer

This article focuses on the fine-tuning failure of vision-language models (e.g., BLIP) in extreme physical domain transfer (such as underwater image captioning) and proposes the two-stage optimization protocol DAMF. By isolating visual realignment and controlled multimodal coupling, this method nearly triples BLEU-4 scores in underwater image captioning tasks, and related results have been accepted by ECCV 2026.

2

Section 02

Background: Domain Transfer Dilemma of Pre-trained VLMs

Vision-language models (VLMs) like BLIP, after pre-training on natural images, work well with joint fine-tuning for similar domains but fail in extreme physical domain (e.g., underwater image) transfer. The underwater environment has unique optical properties such as wavelength attenuation, scattering, turbidity, and color distortion, making visual statistics fundamentally different from land images. When attempting standard fine-tuning of BLIP, training loss decreases but caption quality stagnates or even deteriorates.

3

Section 03

Key Finding: Naive Fine-tuning Actively Impairs Performance

The study found that naive joint fine-tuning is not only ineffective but also actively degrades model performance. This stems from the asymmetric adaptation of visual and text components caused by high-variance gradients: when the visual encoder adapts to underwater features, high-variance gradients from misaligned visual embeddings propagate to the language decoder,破坏ing the pre-trained language structure. It manifests in three unstable modes: early generalization divergence, metric-loss decoupling, and optimization collapse. Experiments show that the pre-trained BLIP baseline has a BLEU-4 score of 0.108, which drops to 0.078 after naive fine-tuning—worse than no adaptation.

4

Section 04

DAMF Method: Two-Stage Domain-Aware Multimodal Fine-tuning

DAMF is a two-stage optimization protocol that requires no architectural changes or new loss functions:

  1. Visual Realignment Stage: Freeze the language decoder, update only the visual encoder and cross-modal projection layer (2 epochs, learning rate 5e-5) to avoid high-variance gradients interfering with the language structure.
  2. Controlled Multimodal Coupling Stage: Unfreeze all parameters, perform joint optimization with a low learning rate (3 epochs, 1e-5), constrain cross-modal gradient variance, and restore cross-modal grounding capabilities. Key Insight: For extreme domain transfer, optimizing the structure rather than learning rate or model capacity is the key.
5

Section 05

Experimental Evidence: DAMF Outperforms Baselines Significantly

On the UICD underwater image captioning dataset, DAMF performs outstandingly:

Method BLEU-4 CIDEr
Pre-trained BLIP 0.108 0.325
Naive full fine-tuning 0.078 — (decoding collapse)
Low-learning-rate full fine-tuning 0.269 0.834
DAMF 0.320 1.149
DAMF nearly triples the BLEU-4 score. Ablation experiments confirm the necessity of the two stages: visual realignment alone gives a BLEU-4 of only 0.050, joint optimization alone gives 0.078, and only their combination achieves the best results.
6

Section 06

Dataset and Implementation Details

The UICD underwater image captioning dataset is used: 3176 images, each with 5 manual captions, split into 70/15/15. Domain features include wavelength attenuation, scattering, etc. The code repository provides implementations such as naive_finetune.py, low_lr_finetune.py, visual_only.py, and damf.py.

7

Section 07

Implications and Outlook

Implications of this study:

  1. The severity of domain transfer determines the strategy: standard fine-tuning is harmful for extreme differences, requiring fine-grained optimization.
  2. Gradient flow control is key: manage asymmetric propagation of cross-modal gradients through stage freezing and unfreezing.
  3. Simple optimization structures are effective: DAMF achieves significant improvements without architectural modifications. The results have been accepted by ECCV 2026, and the code and dataset will be open-sourced after the paper is published to provide guidance for related research.