Section 01
FusionVLM Overview: A RAG-Augmented Visual Language Model for Image Captioning
FusionVLM is an image captioning visual language model that combines multi-modal retrieval with a custom bidirectional fusion block architecture. Its core goal is to improve description quality, reduce hallucination, and enhance generalization by retrieving visually or semantically similar images and descriptions from datasets.
Basic Info:
- Author/Maintainer: Mahan-M47
- Source: GitHub (link: https://github.com/Mahan-M47/FusionVLM-Retrieval-Augmented-Image-Captioning)
- Release Time: 2026-05-28