Reading

DAMF: Addressing Fine-tuning Failure of Vision-Language Models Under Extreme Physical Domain Transfer

When vision-language models face extreme physical domain transfer such as underwater imaging, traditional joint fine-tuning is not only ineffective but also actively degrades model performance. This article introduces the two-stage optimization protocol DAMF, which isolates visual realignment and controlled multimodal coupling to nearly triple BLEU-4 scores in underwater image captioning tasks.

视觉语言模型域迁移多模态学习BLIP水下图像微调优化ECCV2026

Published 2026-04-25 18:31Recent activity 2026-04-25 18:51Estimated read 7 min

DAMF: Addressing Fine-tuning Failure of Vision-Language Models Under Extreme Physical Domain Transfer

Section 01

Introduction: DAMF Addresses VLM Fine-tuning Failure Under Extreme Physical Domain Transfer

This article focuses on the fine-tuning failure of vision-language models (e.g., BLIP) in extreme physical domain transfer (such as underwater image captioning) and proposes the two-stage optimization protocol DAMF. By isolating visual realignment and controlled multimodal coupling, this method nearly triples BLEU-4 scores in underwater image captioning tasks, and related results have been accepted by ECCV 2026.

Section 02

Background: Domain Transfer Dilemma of Pre-trained VLMs

Vision-language models (VLMs) like BLIP, after pre-training on natural images, work well with joint fine-tuning for similar domains but fail in extreme physical domain (e.g., underwater image) transfer. The underwater environment has unique optical properties such as wavelength attenuation, scattering, turbidity, and color distortion, making visual statistics fundamentally different from land images. When attempting standard fine-tuning of BLIP, training loss decreases but caption quality stagnates or even deteriorates.

Section 03

Key Finding: Naive Fine-tuning Actively Impairs Performance

The study found that naive joint fine-tuning is not only ineffective but also actively degrades model performance. This stems from the asymmetric adaptation of visual and text components caused by high-variance gradients: when the visual encoder adapts to underwater features, high-variance gradients from misaligned visual embeddings propagate to the language decoder,破坏ing the pre-trained language structure. It manifests in three unstable modes: early generalization divergence, metric-loss decoupling, and optimization collapse. Experiments show that the pre-trained BLIP baseline has a BLEU-4 score of 0.108, which drops to 0.078 after naive fine-tuning—worse than no adaptation.

Section 04

DAMF Method: Two-Stage Domain-Aware Multimodal Fine-tuning

DAMF is a two-stage optimization protocol that requires no architectural changes or new loss functions:

Visual Realignment Stage: Freeze the language decoder, update only the visual encoder and cross-modal projection layer (2 epochs, learning rate 5e-5) to avoid high-variance gradients interfering with the language structure.
Controlled Multimodal Coupling Stage: Unfreeze all parameters, perform joint optimization with a low learning rate (3 epochs, 1e-5), constrain cross-modal gradient variance, and restore cross-modal grounding capabilities. Key Insight: For extreme domain transfer, optimizing the structure rather than learning rate or model capacity is the key.

Section 05

Experimental Evidence: DAMF Outperforms Baselines Significantly

On the UICD underwater image captioning dataset, DAMF performs outstandingly:

Method	BLEU-4	CIDEr
Pre-trained BLIP	0.108	0.325
Naive full fine-tuning	0.078	— (decoding collapse)
Low-learning-rate full fine-tuning	0.269	0.834
DAMF	0.320	1.149
DAMF nearly triples the BLEU-4 score. Ablation experiments confirm the necessity of the two stages: visual realignment alone gives a BLEU-4 of only 0.050, joint optimization alone gives 0.078, and only their combination achieves the best results.

Section 06

Dataset and Implementation Details

The UICD underwater image captioning dataset is used: 3176 images, each with 5 manual captions, split into 70/15/15. Domain features include wavelength attenuation, scattering, etc. The code repository provides implementations such as naive_finetune.py, low_lr_finetune.py, visual_only.py, and damf.py.

Section 07

Implications and Outlook

Implications of this study:

The severity of domain transfer determines the strategy: standard fine-tuning is harmful for extreme differences, requiring fine-grained optimization.
Gradient flow control is key: manage asymmetric propagation of cross-modal gradients through stage freezing and unfreezing.
Simple optimization structures are effective: DAMF achieves significant improvements without architectural modifications. The results have been accepted by ECCV 2026, and the code and dataset will be open-sourced after the paper is published to provide guidance for related research.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23