# Authomated-Assistant: Mapless Visual Navigation Robot Enabling Autonomous Pathfinding for Office Assistants

> An indoor navigation system based on Comma Body v2 and Vision Language Model (VLM), which achieves autonomous navigation through visual landmark recognition without pre-built maps or lidar, demonstrating the innovative application of VLM in the robotics field.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-03-28T07:06:22.000Z
- 最近活动: 2026-03-28T07:20:57.790Z
- 热度: 139.8
- 关键词: 机器人导航, 视觉语言模型, VLM, Comma Body, 零地图导航, 具身智能, 室内机器人
- 页面链接: https://www.zingnex.cn/en/forum/thread/authomated-assistant
- Canonical: https://www.zingnex.cn/forum/thread/authomated-assistant
- Markdown 来源: floors_fallback

---

## Introduction: Authomated-Assistant—Innovative Breakthrough of a Mapless Visual Navigation Robot

Authomated-Assistant is an indoor navigation system based on the Comma Body v2 robot platform and Vision Language Model (VLM). It achieves autonomous navigation through visual landmark recognition without pre-built maps or lidar. This innovation lowers the hardware threshold for robot navigation and demonstrates the great potential of large visual language models in the field of embodied intelligence.

## Project Background and Core Challenges

Comma Body v2 is an open-source self-balancing two-wheeled robot platform developed by the Comma.ai community, originally used for autonomous driving technology research. Applying it to indoor navigation faces challenges such as large dynamic environment changes, complex lighting conditions, lack of GPS signals, and the need for expensive sensors and tedious map construction in traditional solutions. To address these issues, the project proposes using the scene understanding capability of VLM to enable the robot to navigate autonomously by recognizing landmarks.

## System Architecture and Technical Approach

Authomated-Assistant adopts a layered intelligent design: The visual perception layer uses VLM (e.g., Moondream2) on an eGPU to recognize natural language landmarks, with zero-shot detection capability that requires no additional training data; the motion control layer uses a PID controller to adjust steering and speed; the intelligent search strategy automatically rotates to search when the target is lost, improving robustness.

## Advanced Features and Technology Stack

The project integrates the Google Gemini API for scene analysis and supports voice interaction through TTS technology; the hardware includes Comma Four (three cameras and main control), Comma Body v2 chassis, and eGPU; the software uses Python (control logic), TypeScript/JavaScript (web services), React+Vite+Tailwind CSS (frontend), Express (backend); middleware uses Cereal and BodyJim.

## Application Scenarios and Demo Features

The web dashboard provides functions such as task selection (e.g., navigating to a colleague's location), real-time AI logs, Gemini scene analysis, voice mode switch, hardware telemetry (battery/balance/camera status), etc., making the robot a practical office assistant prototype.

## Open Source Value and Future Outlook

This project was developed for comma_hack 6 hackathon, fully open-source with clear code and complete documentation. Its values include technical demonstration (VLM and robot integration), architecture reference (layered reasoning and control), and community contribution (welcome improvements to VLM accuracy, PID parameter tuning, etc.). In the future, it can be applied to warehouse logistics, service robots, and home scenarios, promoting the democratization of robot technology.
