Section 01
MINOS Introduction: Core Overview of the Multimodal Evaluation Model for Bidirectional Image-Text Generation
MINOS (Multimodal Evaluation Model for Bidirectional Generation) is a multimodal evaluation model specifically designed for bidirectional image-text generation tasks, aiming to address the limitations of traditional evaluation methods in handling bidirectional tasks (such as semantic gap, alignment challenges, and lack of bidirectional consistency). It adopts the design principles of semantics first, bidirectional alignment, and human perception. Through a dual-tower architecture (vision tower + language tower), cross-modal alignment module, and multi-evaluation heads, it provides unified, reliable, and fine-grained evaluation. It supports the assessment of quality, faithfulness, and consistency for tasks like image captioning and text-to-image generation, facilitating scenarios such as model development and content quality control.