Zing Forum

Reading

Authomated-Assistant: Mapless Visual Navigation Robot Enabling Autonomous Pathfinding for Office Assistants

An indoor navigation system based on Comma Body v2 and Vision Language Model (VLM), which achieves autonomous navigation through visual landmark recognition without pre-built maps or lidar, demonstrating the innovative application of VLM in the robotics field.

机器人导航视觉语言模型VLMComma Body零地图导航具身智能室内机器人
Published 2026-03-28 15:06Recent activity 2026-03-28 15:20Estimated read 5 min
Authomated-Assistant: Mapless Visual Navigation Robot Enabling Autonomous Pathfinding for Office Assistants
1

Section 01

Introduction: Authomated-Assistant—Innovative Breakthrough of a Mapless Visual Navigation Robot

Authomated-Assistant is an indoor navigation system based on the Comma Body v2 robot platform and Vision Language Model (VLM). It achieves autonomous navigation through visual landmark recognition without pre-built maps or lidar. This innovation lowers the hardware threshold for robot navigation and demonstrates the great potential of large visual language models in the field of embodied intelligence.

2

Section 02

Project Background and Core Challenges

Comma Body v2 is an open-source self-balancing two-wheeled robot platform developed by the Comma.ai community, originally used for autonomous driving technology research. Applying it to indoor navigation faces challenges such as large dynamic environment changes, complex lighting conditions, lack of GPS signals, and the need for expensive sensors and tedious map construction in traditional solutions. To address these issues, the project proposes using the scene understanding capability of VLM to enable the robot to navigate autonomously by recognizing landmarks.

3

Section 03

System Architecture and Technical Approach

Authomated-Assistant adopts a layered intelligent design: The visual perception layer uses VLM (e.g., Moondream2) on an eGPU to recognize natural language landmarks, with zero-shot detection capability that requires no additional training data; the motion control layer uses a PID controller to adjust steering and speed; the intelligent search strategy automatically rotates to search when the target is lost, improving robustness.

4

Section 04

Advanced Features and Technology Stack

The project integrates the Google Gemini API for scene analysis and supports voice interaction through TTS technology; the hardware includes Comma Four (three cameras and main control), Comma Body v2 chassis, and eGPU; the software uses Python (control logic), TypeScript/JavaScript (web services), React+Vite+Tailwind CSS (frontend), Express (backend); middleware uses Cereal and BodyJim.

5

Section 05

Application Scenarios and Demo Features

The web dashboard provides functions such as task selection (e.g., navigating to a colleague's location), real-time AI logs, Gemini scene analysis, voice mode switch, hardware telemetry (battery/balance/camera status), etc., making the robot a practical office assistant prototype.

6

Section 06

Open Source Value and Future Outlook

This project was developed for comma_hack 6 hackathon, fully open-source with clear code and complete documentation. Its values include technical demonstration (VLM and robot integration), architecture reference (layered reasoning and control), and community contribution (welcome improvements to VLM accuracy, PID parameter tuning, etc.). In the future, it can be applied to warehouse logistics, service robots, and home scenarios, promoting the democratization of robot technology.