Section 01
GRAMformer: A New Transformer Architecture Breaking the Limits of Multimodal Interactions
Key Highlights of GRAMformer
GRAMformer proposes the Volumetric Multimodal Cross-Attention (VMA) mechanism, breaking the limitation of traditional Transformers that can only model pairwise modality interactions. By calculating the volume formed by query vectors and multimodal key vectors, it enables the modeling of any-order joint modality dependencies, opening up a new path for multimodal learning.
Basic Information
- Original Authors: arXiv Team
- Source Platform: arXiv
- Original Paper Title: GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention
- Original Link: http://arxiv.org/abs/2606.06249v1
- Publication Date: June 4, 2026