Section 01
[Overview] Multimodal Prediction Model for Handling Missing Modalities: Robust Representation Learning Based on Attention Mechanism
This paper proposes a multimodal prediction model that can handle missing modalities during both training and inference phases. Based on the conditional variational autoencoder (CVAE) and Transformer architectures, it learns unified and robust representations through attention mechanisms, achieving better performance than previous methods on human trajectory prediction and robot manipulation prediction tasks. This model addresses the problem of sharp performance degradation in traditional multimodal models when modalities are missing, providing a new solution for the practical application of real robot systems.