Section 01
[Introduction] CAVG: A New Scheme for Autonomous Driving Visual Grounding Integrating GPT-4 and Cross-Modal Attention Mechanism
This article introduces the CAVG (Context-Aware Visual Grounding) model, which integrates the GPT-4 large language model and a five-encoder architecture to achieve high-precision multimodal visual grounding in autonomous driving scenarios, achieving SOTA performance on the Talk2Car dataset. Its core innovation lies in combining GPT-4's semantic understanding capability with cross-modal attention mechanism to solve the key problem of mapping natural language instructions to target objects in visual scenes.