Section 01
LinguDistill: Restoring Language Capabilities of Vision-Language Models via Cross-Modal Distillation (Introduction)
When adapting pre-trained language models to vision-language models (VLMs), their language capabilities often degrade due to representation shift and cross-modal interference. LinguDistill proposes an adapter-free distillation method: by sharing inter-layer KV cache to use the frozen original language model as a teacher, and performing selective distillation on language-dense data, it successfully recovers approximately 10% of the lost language performance without affecting visual capabilities.