# Hazelnut-Vox: A Fully Local STT-LLM-TTS Voice Conversation Agent

> Hazelnut-Vox is a fully locally-run interactive voice agent that implements a complete STT-LLM-TTS pipeline, integrating Whisper speech recognition, Ollama large language model, and Coqui TTS speech synthesis. It supports real-time audio spectrum analysis and Polish language interaction.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-23T11:40:52.000Z
- 最近活动: 2026-05-23T11:52:33.408Z
- 热度: 159.8
- 关键词: 语音识别, 语音合成, 大语言模型, Whisper, Ollama, TTS, 本地AI, 语音助手
- 页面链接: https://www.zingnex.cn/en/forum/thread/hazelnut-vox-stt-llm-tts
- Canonical: https://www.zingnex.cn/forum/thread/hazelnut-vox-stt-llm-tts
- Markdown 来源: floors_fallback

---

## Hazelnut-Vox: Introduction to the Fully Local STT-LLM-TTS Voice Conversation Agent

Hazelnut-Vox is a fully locally-run interactive voice agent that implements a complete STT-LLM-TTS pipeline, integrating Whisper speech recognition, Ollama large language model, and Coqui TTS speech synthesis. It supports real-time audio spectrum analysis and Polish language interaction. Key advantages include privacy protection via local operation, offline availability with low latency, CUDA acceleration for improved performance, and application value in both educational and privacy-sensitive scenarios.

## Project Background and Origin

The project initially started as a module for the 'Orzech' robot, then evolved into an independent project for the course 'Artificial Intelligence in Physical Signal Processing'. The name 'Hazelnut-Vox' (Hazelnut Voice) has a deep connection with 'Orzech' (Polish for 'nut'), reflecting the core of voice interaction. Its academic background emphasizes signal processing visualization, serving both as an integrated demonstration of AI voice technology and a practical platform for signal processing theory.

## Details of the STT-LLM-TTS Pipeline Architecture

### Speech Recognition (STT)
Based on the OpenAI Whisper turbo version, balancing accuracy and speed, it handles audio preprocessing, log-Mel spectrogram generation, and supports multilingual text conversion.
### Large Language Model (LLM)
Runs the llama3.2:1b model locally via Ollama, with a custom parser to remove inference labels and output clean natural language.
### Speech Synthesis (TTS)
Uses the Coqui TTS framework with the Polish VITS model (tts_models/pl/mai_female/vits), supporting non-English language interaction.

## Privacy and Performance Advantages of Local Operation

- **Privacy Protection**: All components run locally; no need to upload voice data to the cloud, preventing third parties from accessing conversation content.
- **Offline Availability**: No network dependency, low latency, suitable for scenarios with poor network conditions or high reliability requirements.
- **GPU Acceleration**: Supports CUDA; Whisper and TTS models can leverage NVIDIA GPUs to improve processing speed, with performance close to cloud services.

## Real-Time Audio Processing and Noise Adaptation Capabilities

- Captures audio using the speech_recognition library, dynamically adjusts energy thresholds to adapt to environmental noise, and works stably from quiet offices to noisy spaces.
- The process includes noise detection, voice activity detection, and audio buffer management, automatically establishing a noise baseline and detecting voice activity exceeding the threshold.
- Enables a natural conversation experience where users can interact without pressing buttons or waiting for a prompt tone.

## Signal Analysis and Visualization Features

- Generates time-domain waveforms (showing signal changes over time) and spectrograms (revealing frequency domain distribution) for both user voice and AI-synthesized voice.
- Uses matplotlib and scipy to generate visualization reports, intuitively displaying signal characteristics.
- Has significant educational value, helping students understand the principles of voice synthesis and signal processing theory.

## Application Scenarios and Expansion Potential

- **Application Scenarios**: Smart home control, information query, voice notes; privacy-sensitive scenarios (medical consultation, legal dialogue); educational practice platform (voice technology course experiments).
- **Expansion Directions**: Replace models to support multiple languages; upgrade LLM to improve conversation intelligence; add conversation state management; optimize long-text synthesis, etc.

## Technical Limitations and Improvement Suggestions

- **Limitations**: Mainly supports Polish; simple conversation management (no long-term memory); synthesis naturalness lags behind commercial services.
- **Improvement Directions**: Add multi-language model switching; introduce conversation state management; optimize long-text synthesis effects; add voice wake-up function, etc.