Voice-First Interface
ποΈ Multi-room voice interaction with wake word detection, speaker identification, and privacy controls
This document covers the architecture, data flow, configuration, and deployment of PiSovereignβs voice-first interface.
Table of Contents
- Overview
- Architecture
- Domain Model
- Application Services
- Port Traits
- MQTT Topic Structure
- Docker Deployment
- Configuration Reference
- REST API Endpoints
- Privacy & Whisper Mode
- Troubleshooting
Overview
The voice-first interface turns PiSovereign into a hands-free AI assistant. Audio streams from satellite devices (Raspberry Pi Zero 2W, ESP32-S3, etc.) over MQTT. The server handles wake word detection, speech-to-text, LLM inference, text-to-speech, and speaker identification β all locally, with no cloud dependency.
Key capabilities:
- Multi-room audio β Independent rooms with individual volume, mute, and online status via MQTT
- Wake word detection β βSovereignβ (customizable) via openWakeWord with configurable sensitivity
- Continuous conversation β Follow-up window after a response, no re-triggering needed
- Speaker identification β ECAPA-TDNN embeddings match voice to enrolled user profiles
- Whisper mode β Automatic volume reduction during quiet hours
- Privacy LED β GPIO-controlled indicator when microphone is active (Raspberry Pi only)
Architecture
Component Diagram
graph TB
subgraph Satellites["Room Satellites"]
SAT1["ποΈ Kitchen<br/>RPi Zero 2W"]
SAT2["ποΈ Office<br/>ESP32-S3"]
SAT3["ποΈ Bedroom<br/>RPi Zero 2W"]
end
subgraph Docker["Docker Stack (voice profile)"]
MQTT["Mosquitto<br/>MQTT Broker"]
OWW["openWakeWord<br/>Wake Word Detection"]
SID["Speaker-ID<br/>ECAPA-TDNN"]
APP["PiSovereign<br/>Voice Pipeline"]
STT["Whisper<br/>Speech-to-Text"]
TTS["Piper<br/>Text-to-Speech"]
LLM["Ollama<br/>LLM Inference"]
end
SAT1 & SAT2 & SAT3 -->|"Audio PCM<br/>QoS 0"| MQTT
MQTT -->|"Subscribe"| APP
APP -->|"Audio chunks"| OWW
APP -->|"Voice sample"| SID
APP -->|"Speech audio"| STT
APP -->|"Response text"| TTS
APP -->|"User query"| LLM
APP -->|"TTS audio<br/>QoS 1"| MQTT
MQTT -->|"Playback"| SAT1 & SAT2 & SAT3
Audio Pipeline Flow
sequenceDiagram
participant Sat as Room Satellite
participant MQTT as Mosquitto
participant VP as Voice Pipeline
participant WW as openWakeWord
participant STT as Whisper STT
participant LLM as Ollama
participant TTS as Piper TTS
participant SID as Speaker-ID
Sat->>MQTT: Publish audio (PCM 16-bit, 16kHz)
MQTT->>VP: Deliver audio chunk
VP->>WW: Check for wake word
WW-->>VP: Detection (word, confidence)
Note over VP: Wake word detected β start session
VP->>SID: Identify speaker (audio sample)
SID-->>VP: SpeakerMatch {id, confidence}
VP->>VP: Activate privacy LED (GPIO)
VP->>STT: Transcribe speech
STT-->>VP: Transcription text
VP->>LLM: Generate response
LLM-->>VP: Response text
VP->>TTS: Synthesize speech
TTS-->>VP: Audio PCM
VP->>MQTT: Publish response audio
MQTT->>Sat: Play response
Note over VP: Enter follow-up window (10s default)
Voice Session State Machine
stateDiagram-v2
[*] --> Listening: Wake word detected
Listening --> Processing: Speech captured
Processing --> Responding: LLM response ready
Responding --> FollowUp: TTS playback complete
FollowUp --> Listening: Follow-up speech detected
FollowUp --> Ended: Timeout (10s default)
Listening --> Ended: Max duration (5 min)
Processing --> Ended: Error / timeout
Responding --> Ended: Error
| State | Description | Accepts Input |
|---|---|---|
| Listening | Microphone active, capturing speech | β |
| Processing | STT β LLM pipeline running | β |
| Responding | TTS playing response audio | β |
| FollowUp | Waiting for follow-up within time window | β |
| Ended | Session terminated | β |
Domain Model
Voice Room
A VoiceRoom represents a physical location with a satellite audio device.
| Field | Type | Description |
|---|---|---|
id | RoomId (UUID) | Auto-generated unique identifier |
name | String | Human-readable name (e.g., βKitchenβ) |
is_online | bool | Whether the satellite is sending heartbeats |
volume | u8 | Playback volume (0β100) |
muted | bool | Whether audio output is suppressed |
last_seen | DateTime<Utc> | Last heartbeat timestamp |
created_at | DateTime<Utc> | Registration timestamp |
Rooms go offline automatically when heartbeats stop arriving (configurable timeout, default 30s).
Voice Session
A VoiceSession tracks a single voice interaction within a room.
| Field | Type | Description |
|---|---|---|
id | VoiceSessionId (UUID) | Auto-generated session identifier |
room_id | RoomId | Room where the session is active |
speaker_id | Option<SpeakerId> | Identified speaker (if enrolled) |
conversation_id | ConversationId | Links to LLM conversation context |
state | VoiceSessionState | Current FSM state (see diagram above) |
follow_up_window_ms | u64 | How long to wait for follow-up (default 10s) |
exchange_count | u32 | Number of userβassistant exchanges |
Voice Profile & Speaker Identification
A VoiceProfile stores enrolled speaker embeddings for recognition.
- Embedding model: ECAPA-TDNN (SpeechBrain), 192-dimensional vectors
- Minimum enrollment: 3 audio samples for reliable identification
- Matching: Cosine similarity between embeddings, configurable threshold (default 0.75)
- Audio format: PCM 16-bit little-endian, 16 kHz, mono
graph LR
subgraph Enrollment["Speaker Enrollment (3+ samples)"]
A1["π€ Sample 1"] --> E1["Embedding"]
A2["π€ Sample 2"] --> E2["Embedding"]
A3["π€ Sample 3"] --> E3["Embedding"]
end
subgraph Recognition["Runtime Recognition"]
AX["π€ Live Audio"] --> EX["Live Embedding"]
EX --> COS["Cosine Similarity"]
E1 & E2 & E3 --> COS
COS -->|"> 0.75"| MATCH["β
Speaker Identified"]
COS -->|"β€ 0.75"| UNKNOWN["β Unknown Speaker"]
end
Application Services
VoiceRoomService
Manages room lifecycle and health monitoring.
| Method | Description |
|---|---|
register_room(name) | Register a new room satellite |
heartbeat(room_id) | Update last-seen timestamp, mark online |
check_stale_rooms() | Mark rooms offline if heartbeat timeout exceeded |
set_volume(room_id, volume) | Adjust playback volume (0β100) |
toggle_mute(room_id) | Toggle mute state |
list_rooms() | List all registered rooms |
remove_room(room_id) | Unregister a room |
A background task calls check_stale_rooms() every 30 seconds.
VoiceSessionService
Manages voice session state transitions.
| Method | Description |
|---|---|
start_or_resume_session(room_id) | Start new session or resume from FollowUp |
mark_processing(session_id) | Transition to Processing state |
mark_responding(session_id) | Transition to Responding state |
enter_follow_up(session_id) | Transition to FollowUp state |
end_session(session_id) | Terminate session |
get_active_session(room_id) | Query current session for a room |
SpeakerEnrollmentService
Manages speaker profile enrollment and identification data.
| Method | Description |
|---|---|
add_enrollment_sample(user_id, name, audio) | Add voice sample (creates profile if needed) |
enrollment_status(user_id) | Check enrollment progress |
delete_profile(speaker_id) | Remove all speaker data |
list_profiles() | List all enrolled speakers |
rename_profile(speaker_id, name) | Update display name |
VoicePipelineService
Orchestrates the full voice interaction pipeline (wake word β STT β LLM β TTS).
| Method | Description |
|---|---|
process_voice_command(room_id, audio) | Full pipeline: detect speaker, transcribe, infer, synthesize |
is_wake_word_available() | Health check for wake word service |
is_speaker_id_available() | Health check for speaker-id service |
Port Traits
The voice subsystem defines 7 port traits in crates/application/src/ports/:
| Port | Purpose |
|---|---|
WakeWordPort | Detect wake words in audio chunks |
SpeakerIdentificationPort | Identify/enroll speakers from audio |
MqttPort | Publish/subscribe to MQTT topics |
VoiceRoomPort | Persist room entities |
VoiceSessionPort | Persist session state |
VoiceProfileStore | Persist speaker profiles and embeddings |
GpioPort | Control privacy LED (RPi only) |
All ports use #[async_trait] and support mockall via #[cfg_attr(test, automock)].
MQTT Topic Structure
All topics use a configurable prefix (default: pisovereign).
| Topic Pattern | QoS | Direction | Description |
|---|---|---|---|
{prefix}/audio/{room_id}/input | 0 | Satellite β Server | Raw PCM audio stream |
{prefix}/audio/{room_id}/output | 1 | Server β Satellite | TTS response audio |
{prefix}/wake/{room_id} | 1 | Server β Satellite | Wake word detection event |
{prefix}/control/{room_id} | 1 | Bidirectional | Volume, mute, and room control commands |
{prefix}/status/{room_id} | 0 | Satellite β Server | Heartbeat and status updates |
Audio format: PCM 16-bit little-endian, 16 kHz, mono.
Docker Deployment
Services
The voice stack consists of three services, activated via the voice Docker Compose profile:
| Service | Image | Port | Memory | Purpose |
|---|---|---|---|---|
| mosquitto | eclipse-mosquitto:2 | 1883 (internal) | 128 MB | MQTT message broker |
| openwakeword | Custom (Python) | 8083 (internal) | 512 MB | Wake word detection via openWakeWord |
| speaker-id | Custom (Python) | 8084 (internal) | 512 MB | Speaker identification via SpeechBrain ECAPA-TDNN |
All services run on the internal pisovereign-network and are never exposed externally.
Starting the Voice Stack
# Start core + voice services
just docker-up # if voice profile is in COMPOSE_PROFILES
# Or explicitly with the voice profile
docker compose --profile voice up -d
# Verify services are healthy
docker compose --profile voice ps
Resource Requirements
| Platform | RAM (voice stack) | Notes |
|---|---|---|
| Raspberry Pi 5 (8 GB) | ~1.2 GB | Recommended minimum for voice + core |
| x86_64 Desktop | ~1.0 GB | Faster model loading |
The speaker-id service downloads ECAPA-TDNN models (~80 MB) on first start. Models are persisted in the speaker-id-models Docker volume.
Configuration Reference
Add to config.toml to enable the voice interface:
[voice]
enabled = true
[voice.mqtt]
broker_url = "mqtt://mosquitto:1883"
client_id = "pisovereign-voice"
keep_alive_secs = 30
max_inflight = 100
topic_prefix = "pisovereign"
[voice.wake_word]
service_url = "http://openwakeword:8083"
words = ["sovereign"]
sensitivity = 0.5 # 0.0β1.0, higher = fewer false positives
timeout_ms = 2000
[voice.speaker_id]
service_url = "http://speaker-id:8084"
min_enrollment_samples = 3
match_threshold = 0.75 # Cosine similarity threshold
timeout_ms = 5000
[voice.conversation]
follow_up_window_ms = 10000 # 10s follow-up after response
max_session_duration_ms = 300000 # 5 minutes max
[voice.whisper_mode]
enabled = true
quiet_start = "22:00"
quiet_end = "07:00"
quiet_volume = 30 # 0β100
[voice.gpio]
enabled = false # Only on Raspberry Pi (aarch64 Linux)
privacy_led_pin = 17 # BCM GPIO pin number
[voice.rooms]
default_volume = 80 # 0β100
heartbeat_timeout_ms = 30000 # Mark offline after 30s silence
All values shown are defaults. The [voice] section is optional β when omitted, the voice subsystem is disabled.
REST API Endpoints
All voice endpoints require API key authentication.
Room Management
| Method | Path | Description |
|---|---|---|
GET | /v1/voice/rooms | List all registered rooms |
POST | /v1/voice/rooms | Register a new room |
DELETE | /v1/voice/rooms/{room_id} | Remove a room |
GET | /v1/voice/rooms/{room_id}/session | Get active session for a room |
Speaker Management
| Method | Path | Description |
|---|---|---|
GET | /v1/voice/speakers | List enrolled speaker profiles |
DELETE | /v1/voice/speakers/{speaker_id} | Delete a speaker profile |
Status
| Method | Path | Description |
|---|---|---|
GET | /v1/voice/status | Voice subsystem health check |
Example: Register a Room
curl -X POST http://localhost:3000/v1/voice/rooms \
-H "Authorization: Bearer sk-your-api-key" \
-H "Content-Type: application/json" \
-d '{"name": "Kitchen"}'
Response:
{
"id": "550e8400-e29b-41d4-a716-446655440000",
"name": "Kitchen",
"is_online": false,
"volume": 80,
"muted": false,
"last_seen": "2025-01-15T10:30:00Z",
"created_at": "2025-01-15T10:30:00Z"
}
Example: Voice Status
curl http://localhost:3000/v1/voice/status \
-H "Authorization: Bearer sk-your-api-key"
Response (voice disabled):
{
"enabled": false,
"wake_word_available": false,
"speaker_id_available": false,
"mqtt_connected": false,
"active_rooms": 0,
"active_sessions": 0
}
Privacy & Whisper Mode
Privacy LED
On Raspberry Pi, a GPIO-connected LED indicates when the microphone is active:
- LED on: Audio is being captured and processed
- LED off: No active voice session
Configure the BCM GPIO pin in [voice.gpio]. Requires aarch64 Linux β the feature compiles to a no-op on other platforms.
Whisper Mode
During quiet hours (default 22:00β07:00), whisper mode automatically:
- Reduces TTS playback volume to the configured level (default 30%)
- Uses softer TTS voice parameters when available
- Restores normal volume outside quiet hours
Troubleshooting
Voice services not starting
# Check if voice profile is enabled
docker compose --profile voice ps
# Check individual service logs
docker compose --profile voice logs mosquitto
docker compose --profile voice logs openwakeword
docker compose --profile voice logs speaker-id
Wake word not detected
- Verify openWakeWord is healthy:
curl http://localhost:8083/health - Increase sensitivity (closer to 1.0) in
[voice.wake_word] - Check audio format: must be PCM 16-bit LE, 16 kHz, mono
- Check MQTT connectivity:
mosquitto_sub -t 'pisovereign/audio/#' -v
Speaker not recognized
- Ensure at least 3 enrollment samples are recorded
- Lower
match_threshold(default 0.75) if false negatives are high - Re-enroll in a quiet environment for better embeddings
- Check speaker-id service health:
curl http://localhost:8084/health
Room shows offline
- Check satellite heartbeat interval (must be <
heartbeat_timeout_ms) - Verify MQTT topic:
pisovereign/status/{room_id} - Check Mosquitto logs for connection issues
GPIO LED not working
- Only supported on Raspberry Pi (aarch64 Linux)
- Verify BCM pin number matches physical wiring
- Check GPIO permissions (user must be in
gpiogroup or run as root)