Section 01
MoTVLA: An Introduction to the Innovative Architecture for Enhancing Spatial Reasoning in VLA Models
MoTVLA is a Vision-Language-Action (VLA) model based on the Mamba architecture. It solves the problem of traditional VLA models lacking an explicit spatial verification mechanism through Gaussian Spatial Tokenizer (GST) and Depth-Aware Chain-of-Thought (DA-CoT). It achieves an average success rate of 90% on the LIBERO benchmark while maintaining real-time inference speed on a single GPU.