Zing Forum

Reading

TempoVLA: A Vision-Language-Action Model for Robots to Execute Tasks with Controllable Speed

Researchers propose a speed-controllable VLA model that enables robots to move quickly in low-risk phases and slow down for precise operations in high-risk contact phases.

视觉-语言-动作模型机器人控制速度控制轨迹增强动态执行
Published 2026-06-05 01:59Recent activity 2026-06-05 18:19Estimated read 7 min
TempoVLA: A Vision-Language-Action Model for Robots to Execute Tasks with Controllable Speed
1

Section 01

TempoVLA: Guide to the Speed-Controllable Vision-Language-Action Model

Key Highlights of TempoVLA The research team proposes the TempoVLA model to address the limitation of fixed speed in existing Vision-Language-Action (VLA) models, enabling robots to move quickly in low-risk phases and slow down for precise operations in high-risk contact phases. Its core insight is that motion amplitude determines execution speed, and flexible speed control is achieved through a dual-component architecture (Variable-Speed Trajectory Augmentation VSTA + Speed Conditioning Mechanism). The effectiveness has been verified in both simulation and real-world tasks, providing a new foundation for robot operating systems.

Original Authors/Source

  • Author Team: Paper author team
  • Source: arXiv
  • Original Title: TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies
  • Link: http://arxiv.org/abs/2606.06491v1
  • Publication Date: June 4, 2026
2

Section 02

Problem Background: Limitations of Fixed-Speed VLA

Problem Background: Limitations of Fixed Speed

Robot operations include low-risk transition phases (e.g., moving to a target) and high-risk contact phases (e.g., grasping and assembly). Humans can dynamically adjust speed, but existing VLA models only inherit the single fixed speed from training demonstrations.

Shortcomings of Existing Solutions

Previous methods to accelerate VLA (model compression, KV cache reuse, reinforcement learning fine-tuning) can only switch between fixed speeds and cannot adjust dynamically. Moreover, the deceleration problem has not been fully explored, making it difficult to perform precise slow execution in high-risk phases.

3

Section 03

TempoVLA Architecture: Dual Components for Speed Control

TempoVLA Dual-Component Architecture

Core Insight

Motion amplitude (the amount of pose change of joints/end-effectors) determines the robot's movement speed: larger amplitude leads to longer execution time (slower), while smaller amplitude leads to faster speed.

1. Data Side: Variable-Speed Trajectory Augmentation (VSTA)

  • Acceleration: Merge adjacent actions to increase amplitude and complete movement quickly
  • Deceleration: Split actions to reduce amplitude and execute slowly
  • Effect: Preserves motion semantics, accurately reaches target speed, and improves default performance at 1x speed

2. Model Side: Speed Conditioning Mechanism

Feed the target speed as an explicit input to the policy network to generate actions with corresponding amplitudes, enabling flexible speed control.

4

Section 04

Experimental Validation: Results from Simulation to Real World

Experimental Validation Results

Bidirectional Speed Control

  • Low-risk transition phase: Fast movement saves time
  • High-risk contact phase: Slow execution improves success rate

Dynamic Speed Adjustment

Cooperation with Large Multimodal Models (LMM):

  • LMM analyzes the scene to determine risk level and sends speed commands (e.g., slow down when approaching the target, speed up when moving away from obstacles)
  • The hierarchical architecture combines high-level scene understanding and low-level motion control, showing the direction of end-to-end systems.
5

Section 05

Technical Contributions and Engineering Significance

Technical Contributions and Engineering Significance

Theoretical Aspect

  • Reveals the essential relationship between motion amplitude and execution speed
  • Proposes a new paradigm for variable-speed learning (data augmentation instead of modifying model structure)

Engineering Aspect

  • A single model supports multiple speeds without training multiple models
  • Speed conditioning is plug-and-play, easy to integrate into existing VLA architectures
  • VSTA improves data utilization and enhances basic performance

Application Scenarios

  • Industrial assembly: Fast approach + slow assembly
  • Service robots: Dynamically adjust speed based on environmental complexity
  • Medical robots: Extremely slow execution for high-risk operations, fast movement in transition phases
6

Section 06

Limitations and Future Research Directions

Limitations and Future Directions

Current Limitations

  1. Speed range is limited by the coverage of training data
  2. Poor generalization for extreme speeds (beyond training distribution)
  3. Dynamic control relies on LMM scene analysis, which may increase inference latency

Future Research

  • Combine reinforcement learning to optimize speed strategies
  • Explore self-supervised variable-speed learning without speed labels
  • Extend to complex robot forms such as humanoid and soft robots