Reading

Pepper Robot Real-Time Multimodal Dialogue Framework: A Fusion Practice of End-to-End Voice Interaction and Agent Control

This article introduces an open-source Android framework that deeply integrates modern end-to-end voice large models with the Pepper humanoid robot, enabling natural language control of robot navigation, visual analysis, and interactive entertainment, and providing a complete open-source solution for human-robot interaction research.

Pepper机器人多模态交互端到端语音模型Realtime APIGemini LiveFunction Calling自主导航人机交互开源框架智能体控制

Published 2026-04-29 21:05Recent activity 2026-04-29 21:20Estimated read 7 min

Pepper Robot Real-Time Multimodal Dialogue Framework: A Fusion Practice of End-to-End Voice Interaction and Agent Control

Section 01

[Introduction] Core Introduction to the Open-Source Project of Pepper Robot Real-Time Multimodal Dialogue Framework

This article introduces the open-source Android framework pepper-android-realtime-chat, which deeply integrates end-to-end voice large models such as OpenAI Realtime API and Google Gemini Live with the Pepper humanoid robot, enabling functions like natural language-controlled navigation, visual analysis, and interactive entertainment. The project supports deployment on Pepper hardware and ordinary Android devices, was presented at the 2026 HRI Conference, and provides a complete open-source solution for human-robot interaction research.

Section 02

[Background] Trend of Integration Between Humanoid Robots and Large Models and Project Positioning

The combination of humanoid robots and large language models is redefining the boundaries of human-robot interaction. As a classic platform, Pepper can exhibit strong interactive capabilities by integrating with modern AI technologies. This project introduces end-to-end voice large models into Pepper, builds a multimodal interaction system, supports independent Android deployment, and provides flexibility for developers and researchers.

Section 03

[Technical Architecture] Dual-Mode Construction Strategy and Modern Android Tech Stack

The project adopts a dual-construction strategy:

Pepper Mode: Integrates NAOqi OS via QiSDK, supporting hardware functions such as navigation, gestures, and sensors;
Standalone Mode: Adapts to ordinary Android devices, simulating robot functions to lower the development threshold. The tech stack includes Kotlin, Jetpack Compose, Hilt, Gradle 8.13, etc., and is compatible with Pepper Android 6.0 (API 23).

Section 04

[Core Capabilities] Key Features of the Multimodal Interaction System

Voice Interaction

Supports models like OpenAI Realtime API, Azure OpenAI, xAI Grok, Google Gemini Live, providing low-latency dialogue, multilingual support, and instant language switching.

Visual Perception

Integrates room mapping and autonomous navigation, supporting natural language commands (e.g., "Move forward 2 meters") and intelligent target approach functions.

Visual Analysis

Can adjust head posture to capture images, analyze the environment via visual large models, and Gemini Live supports real-time video stream dynamic perception.

Tactile Interaction

Responds to touch events from sensors on the head, hands, etc., triggering natural dialogue responses.

Section 05

[Agent Control] Function Implementation from Dialogue to Action

Navigation and Mapping

Supports mapping, saving locations (e.g., "Save as kitchen") and fuzzy matching error correction (e.g., correcting "dormitory" to "doorway").

Gaze Control

Precisely controls head posture via natural language commands (e.g., "Look at 1 meter above 2 meters to the left").

Event Rule Engine

Configures perception events to trigger interactions (e.g., greeting when a person approaches), supporting conditional filtering and dynamic template variables.

Interactive Applications

Built-in voice-controlled games like Tic-Tac-Toe and Memory Flip Card, a dynamic quiz generator, and practical tools such as real-time search and weather query.

Section 06

[Perception System] Human Perception Dashboard and Privacy Compliance

The project develops a custom human perception system that detects people in the field of view in real time, providing tracking IDs, distance estimation, gaze judgment, and facial recognition (processed locally, compliant with GDPR/CCPA). The visual dashboard includes a personnel list, radar view, and facial database management interface.

Section 07

[Development and Deployment] Convenient Development Experience and Multi-Scenario Support

Deployment steps are simple: Clone the repository → Configure API keys → Select build mode → Deploy to Pepper or Android device via ADB. Supports multi-API access such as OpenAI Direct, Azure OpenAI, xAI Grok, Google Gemini, and provides a one-click deployment solution for local facial recognition servers using Docker+SSH.

Section 08

[Conclusion] Promotion of Open-Source Ecosystem to Human-Robot Interaction Innovation

The pepper-android-realtime-chat project provides an experimental platform for HRI research, a multimodal AI development case for developers, and an innovative tool for educators. Open-source infrastructure will help more innovative applications integrating AI and robots, promoting the development of the human-robot interaction field.