Reading

AMD-Proj: An Adaptive Memory-Driven Selective Gradient Projection Method for Continual Learning in Document Understanding

This article introduces AMD-Proj, a novel framework for continual learning in the field of document understanding. Through an adaptive memory-driven selective gradient projection mechanism, this method prevents catastrophic forgetting while maintaining model plasticity, effectively addressing the stability-plasticity dilemma faced by multimodal document understanding models when sequentially learning new tasks.

持续学习文档理解梯度投影灾难性遗忘多模态学习LayoutLM自适应记忆参数高效微调视觉文档理解Transformer模型

Published 2026-04-23 08:00Recent activity 2026-04-25 18:24Estimated read 7 min

AMD-Proj: An Adaptive Memory-Driven Selective Gradient Projection Method for Continual Learning in Document Understanding

Section 01

AMD-Proj: Introduction to the New Framework for Continual Learning in Document Understanding

Section 02

Background and Challenges of Continual Learning in Document Understanding

Document understanding is a core direction in the interdisciplinary field of computer vision and natural language processing, applied in scenarios such as invoice parsing and form recognition. However, the problem of 'catastrophic forgetting' exists in continual learning—traditional fine-tuning can degrade the performance of previous tasks. Existing continual learning methods (e.g., EWC, LwF) perform well in general visual tasks, but document understanding involves tight coupling between visual layout and text semantics, requiring higher multimodal fusion, so existing methods face unique challenges.

Section 03

Core Ideas of the AMD-Proj Method

AMD-Proj organically combines 'memory' and 'gradient projection'. Its core innovation is the adaptive memory-driven selective gradient projection mechanism. Traditional gradient projection methods use fixed strategies, while AMD-Proj maintains memory representations for each learned task (recording parameter directions, task importance, and relationships). It adaptively selects the parameter subspaces to protect based on factors such as the similarity between the current task and historical tasks, improving parameter utilization efficiency and balancing stability and plasticity.

Section 04

In-depth Analysis of the AMD-Proj Technical Mechanism

Hierarchical Gradient Projection Strategy

For different layers of Transformer document understanding models (e.g., LayoutLMv2/v3), independent parameter subspaces are maintained. Shallow layers (low-level features) retain high plasticity, while deep layers (high-level semantics) are more protected, achieving refined control.

Truncated SVD and Spectral Analysis

Truncated SVD is used to approximate parameter subspaces, reducing storage and filtering noise; spectral analysis is used to determine task complexity and specificity, assisting gradient projection decisions.

Task Incremental Learning Setup

Optimized for task incremental scenarios, the model sequentially learns clearly defined tasks (e.g., different document types) and uses task identity signals for adaptive decision-making.

Section 05

Experimental Validation and Result Analysis of AMD-Proj

Evaluation Datasets and Benchmarks

Evaluated on four datasets: FUNSD (forms), CORD (receipts), SROIE (invoices), and BuDDIE (business documents). Compared with classic methods (EWC, LwF), document understanding-specific methods (CUBER), and original gradient projection methods (GPM, TRGP).

Key Findings

AMD-Proj significantly outperforms existing methods in F1 scores across all datasets, with an average improvement of 3-5 percentage points; it has strong anti-forgetting ability, with extremely low performance decay for the earliest tasks.

Ablation Experiments

Removing the adaptive selection strategy leads to decreased parameter efficiency; removing the memory mechanism causes severe forgetting; hierarchical projection is superior to the global strategy.

Section 06

Practical Application Value and Deployment Considerations of AMD-Proj

Enterprise Document Automation

Supports progressive learning of new document types, avoiding model fragmentation or high costs of retraining, and reducing system maintenance costs.

Parameter Efficiency and Computational Overhead

Through truncated SVD and selective projection, additional storage requirements are low; there is no extra computational overhead during inference, so latency is not increased.

Interpretability and Controllability

Understand task representations through the structure of memory subspaces; provides manual intervention interfaces (e.g., adjusting task weights) to meet the needs of high-risk scenarios.

Section 07

Limitations and Future Outlook of AMD-Proj

Limitations

Currently targeted at task incremental learning scenarios; effectiveness in class/domain incremental scenarios remains to be verified; assumes similar task importance and does not incorporate explicit priority control.

Future Directions

Combine with parameter-efficient fine-tuning techniques (e.g., LoRA, Adapter); extend to continual learning for multimodal large models (e.g., GPT-4V, Gemini); explore task priority control mechanisms.