Voice-First Interface

🎙️ Multi-room voice interaction with wake word detection, speaker identification, and privacy controls

This document covers the architecture, data flow, configuration, and deployment of PiSovereign’s voice-first interface.

Overview
Architecture
Domain Model
Application Services
Port Traits
MQTT Topic Structure
Docker Deployment
Configuration Reference
REST API Endpoints
Privacy & Whisper Mode
Troubleshooting

Overview

The voice-first interface turns PiSovereign into a hands-free AI assistant. Audio streams from satellite devices (Raspberry Pi Zero 2W, ESP32-S3, etc.) over MQTT. The server handles wake word detection, speech-to-text, LLM inference, text-to-speech, and speaker identification — all locally, with no cloud dependency.

Key capabilities:

Multi-room audio — Independent rooms with individual volume, mute, and online status via MQTT
Wake word detection — “Sovereign” (customizable) via openWakeWord with configurable sensitivity
Continuous conversation — Follow-up window after a response, no re-triggering needed
Speaker identification — ECAPA-TDNN embeddings match voice to enrolled user profiles
Whisper mode — Automatic volume reduction during quiet hours
Privacy LED — GPIO-controlled indicator when microphone is active (Raspberry Pi only)

Architecture

Component Diagram

graph TB
    subgraph Satellites["Room Satellites"]
        SAT1["🎙️ Kitchen<br/>RPi Zero 2W"]
        SAT2["🎙️ Office<br/>ESP32-S3"]
        SAT3["🎙️ Bedroom<br/>RPi Zero 2W"]
    end

    subgraph Docker["Docker Stack (voice profile)"]
        MQTT["Mosquitto<br/>MQTT Broker"]
        OWW["openWakeWord<br/>Wake Word Detection"]
        SID["Speaker-ID<br/>ECAPA-TDNN"]
        APP["PiSovereign<br/>Voice Pipeline"]
        STT["Whisper<br/>Speech-to-Text"]
        TTS["Piper<br/>Text-to-Speech"]
        LLM["Ollama<br/>LLM Inference"]
    end

    SAT1 & SAT2 & SAT3 -->|"Audio PCM<br/>QoS 0"| MQTT
    MQTT -->|"Subscribe"| APP
    APP -->|"Audio chunks"| OWW
    APP -->|"Voice sample"| SID
    APP -->|"Speech audio"| STT
    APP -->|"Response text"| TTS
    APP -->|"User query"| LLM
    APP -->|"TTS audio<br/>QoS 1"| MQTT
    MQTT -->|"Playback"| SAT1 & SAT2 & SAT3

Audio Pipeline Flow

sequenceDiagram
    participant Sat as Room Satellite
    participant MQTT as Mosquitto
    participant VP as Voice Pipeline
    participant WW as openWakeWord
    participant STT as Whisper STT
    participant LLM as Ollama
    participant TTS as Piper TTS
    participant SID as Speaker-ID

    Sat->>MQTT: Publish audio (PCM 16-bit, 16kHz)
    MQTT->>VP: Deliver audio chunk
    VP->>WW: Check for wake word
    WW-->>VP: Detection (word, confidence)

    Note over VP: Wake word detected → start session

    VP->>SID: Identify speaker (audio sample)
    SID-->>VP: SpeakerMatch {id, confidence}
    VP->>VP: Activate privacy LED (GPIO)
    VP->>STT: Transcribe speech
    STT-->>VP: Transcription text
    VP->>LLM: Generate response
    LLM-->>VP: Response text
    VP->>TTS: Synthesize speech
    TTS-->>VP: Audio PCM
    VP->>MQTT: Publish response audio
    MQTT->>Sat: Play response

    Note over VP: Enter follow-up window (10s default)

Voice Session State Machine

stateDiagram-v2
    [*] --> Listening: Wake word detected
    Listening --> Processing: Speech captured
    Processing --> Responding: LLM response ready
    Responding --> FollowUp: TTS playback complete
    FollowUp --> Listening: Follow-up speech detected
    FollowUp --> Ended: Timeout (10s default)
    Listening --> Ended: Max duration (5 min)
    Processing --> Ended: Error / timeout
    Responding --> Ended: Error

State	Description	Accepts Input
Listening	Microphone active, capturing speech	✅
Processing	STT → LLM pipeline running	❌
Responding	TTS playing response audio	❌
FollowUp	Waiting for follow-up within time window	✅
Ended	Session terminated	❌

Domain Model

Voice Room

A VoiceRoom represents a physical location with a satellite audio device.

Field	Type	Description
`id`	`RoomId` (UUID)	Auto-generated unique identifier
`name`	`String`	Human-readable name (e.g., “Kitchen”)
`is_online`	`bool`	Whether the satellite is sending heartbeats
`volume`	`u8`	Playback volume (0–100)
`muted`	`bool`	Whether audio output is suppressed
`last_seen`	`DateTime<Utc>`	Last heartbeat timestamp
`created_at`	`DateTime<Utc>`	Registration timestamp

Rooms go offline automatically when heartbeats stop arriving (configurable timeout, default 30s).

Voice Session

A VoiceSession tracks a single voice interaction within a room.

Field	Type	Description
`id`	`VoiceSessionId` (UUID)	Auto-generated session identifier
`room_id`	`RoomId`	Room where the session is active
`speaker_id`	`Option<SpeakerId>`	Identified speaker (if enrolled)
`conversation_id`	`ConversationId`	Links to LLM conversation context
`state`	`VoiceSessionState`	Current FSM state (see diagram above)
`follow_up_window_ms`	`u64`	How long to wait for follow-up (default 10s)
`exchange_count`	`u32`	Number of user↔assistant exchanges

Voice Profile & Speaker Identification

A VoiceProfile stores enrolled speaker embeddings for recognition.

Embedding model: ECAPA-TDNN (SpeechBrain), 192-dimensional vectors
Minimum enrollment: 3 audio samples for reliable identification
Matching: Cosine similarity between embeddings, configurable threshold (default 0.75)
Audio format: PCM 16-bit little-endian, 16 kHz, mono

graph LR
    subgraph Enrollment["Speaker Enrollment (3+ samples)"]
        A1["🎤 Sample 1"] --> E1["Embedding"]
        A2["🎤 Sample 2"] --> E2["Embedding"]
        A3["🎤 Sample 3"] --> E3["Embedding"]
    end

    subgraph Recognition["Runtime Recognition"]
        AX["🎤 Live Audio"] --> EX["Live Embedding"]
        EX --> COS["Cosine Similarity"]
        E1 & E2 & E3 --> COS
        COS -->|"> 0.75"| MATCH["✅ Speaker Identified"]
        COS -->|"≤ 0.75"| UNKNOWN["❓ Unknown Speaker"]
    end

Application Services

VoiceRoomService

Manages room lifecycle and health monitoring.

Method	Description
`register_room(name)`	Register a new room satellite
`heartbeat(room_id)`	Update last-seen timestamp, mark online
`check_stale_rooms()`	Mark rooms offline if heartbeat timeout exceeded
`set_volume(room_id, volume)`	Adjust playback volume (0–100)
`toggle_mute(room_id)`	Toggle mute state
`list_rooms()`	List all registered rooms
`remove_room(room_id)`	Unregister a room

A background task calls check_stale_rooms() every 30 seconds.

VoiceSessionService

Manages voice session state transitions.

Method	Description
`start_or_resume_session(room_id)`	Start new session or resume from FollowUp
`mark_processing(session_id)`	Transition to Processing state
`mark_responding(session_id)`	Transition to Responding state
`enter_follow_up(session_id)`	Transition to FollowUp state
`end_session(session_id)`	Terminate session
`get_active_session(room_id)`	Query current session for a room

SpeakerEnrollmentService

Manages speaker profile enrollment and identification data.

Method	Description
`add_enrollment_sample(user_id, name, audio)`	Add voice sample (creates profile if needed)
`enrollment_status(user_id)`	Check enrollment progress
`delete_profile(speaker_id)`	Remove all speaker data
`list_profiles()`	List all enrolled speakers
`rename_profile(speaker_id, name)`	Update display name

VoicePipelineService

Orchestrates the full voice interaction pipeline (wake word → STT → LLM → TTS).

Method	Description
`process_voice_command(room_id, audio)`	Full pipeline: detect speaker, transcribe, infer, synthesize
`is_wake_word_available()`	Health check for wake word service
`is_speaker_id_available()`	Health check for speaker-id service

Port Traits

The voice subsystem defines 7 port traits in crates/application/src/ports/:

Port	Purpose
`WakeWordPort`	Detect wake words in audio chunks
`SpeakerIdentificationPort`	Identify/enroll speakers from audio
`MqttPort`	Publish/subscribe to MQTT topics
`VoiceRoomPort`	Persist room entities
`VoiceSessionPort`	Persist session state
`VoiceProfileStore`	Persist speaker profiles and embeddings
`GpioPort`	Control privacy LED (RPi only)

All ports use #[async_trait] and support mockall via #[cfg_attr(test, automock)].

MQTT Topic Structure

All topics use a configurable prefix (default: pisovereign).

Topic Pattern	QoS	Direction	Description
`{prefix}/audio/{room_id}/input`	0	Satellite → Server	Raw PCM audio stream
`{prefix}/audio/{room_id}/output`	1	Server → Satellite	TTS response audio
`{prefix}/wake/{room_id}`	1	Server → Satellite	Wake word detection event
`{prefix}/control/{room_id}`	1	Bidirectional	Volume, mute, and room control commands
`{prefix}/status/{room_id}`	0	Satellite → Server	Heartbeat and status updates

Audio format: PCM 16-bit little-endian, 16 kHz, mono.

Docker Deployment

Services

The voice stack consists of three services, activated via the voice Docker Compose profile:

Service	Image	Port	Memory	Purpose
mosquitto	`eclipse-mosquitto:2`	1883 (internal)	128 MB	MQTT message broker
openwakeword	Custom (Python)	8083 (internal)	512 MB	Wake word detection via openWakeWord
speaker-id	Custom (Python)	8084 (internal)	512 MB	Speaker identification via SpeechBrain ECAPA-TDNN

All services run on the internal pisovereign-network and are never exposed externally.

Starting the Voice Stack

# Start core + voice services
just docker-up  # if voice profile is in COMPOSE_PROFILES

# Or explicitly with the voice profile
docker compose --profile voice up -d

# Verify services are healthy
docker compose --profile voice ps

Resource Requirements

Platform	RAM (voice stack)	Notes
Raspberry Pi 5 (8 GB)	~1.2 GB	Recommended minimum for voice + core
x86_64 Desktop	~1.0 GB	Faster model loading

The speaker-id service downloads ECAPA-TDNN models (~80 MB) on first start. Models are persisted in the speaker-id-models Docker volume.

Configuration Reference

Add to config.toml to enable the voice interface:

[voice]
enabled = true

[voice.mqtt]
broker_url = "mqtt://mosquitto:1883"
client_id = "pisovereign-voice"
keep_alive_secs = 30
max_inflight = 100
topic_prefix = "pisovereign"

[voice.wake_word]
service_url = "http://openwakeword:8083"
words = ["sovereign"]
sensitivity = 0.5          # 0.0–1.0, higher = fewer false positives
timeout_ms = 2000

[voice.speaker_id]
service_url = "http://speaker-id:8084"
min_enrollment_samples = 3
match_threshold = 0.75     # Cosine similarity threshold
timeout_ms = 5000

[voice.conversation]
follow_up_window_ms = 10000    # 10s follow-up after response
max_session_duration_ms = 300000  # 5 minutes max

[voice.whisper_mode]
enabled = true
quiet_start = "22:00"
quiet_end = "07:00"
quiet_volume = 30          # 0–100

[voice.gpio]
enabled = false            # Only on Raspberry Pi (aarch64 Linux)
privacy_led_pin = 17       # BCM GPIO pin number

[voice.rooms]
default_volume = 80        # 0–100
heartbeat_timeout_ms = 30000  # Mark offline after 30s silence

All values shown are defaults. The [voice] section is optional — when omitted, the voice subsystem is disabled.

REST API Endpoints

All voice endpoints require API key authentication.

Room Management

Method	Path	Description
`GET`	`/v1/voice/rooms`	List all registered rooms
`POST`	`/v1/voice/rooms`	Register a new room
`DELETE`	`/v1/voice/rooms/{room_id}`	Remove a room
`GET`	`/v1/voice/rooms/{room_id}/session`	Get active session for a room

Speaker Management

Method	Path	Description
`GET`	`/v1/voice/speakers`	List enrolled speaker profiles
`DELETE`	`/v1/voice/speakers/{speaker_id}`	Delete a speaker profile

Status

Method	Path	Description
`GET`	`/v1/voice/status`	Voice subsystem health check

Example: Register a Room

curl -X POST http://localhost:3000/v1/voice/rooms \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"name": "Kitchen"}'

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Kitchen",
  "is_online": false,
  "volume": 80,
  "muted": false,
  "last_seen": "2025-01-15T10:30:00Z",
  "created_at": "2025-01-15T10:30:00Z"
}

Example: Voice Status

curl http://localhost:3000/v1/voice/status \
  -H "Authorization: Bearer sk-your-api-key"

Response (voice disabled):

{
  "enabled": false,
  "wake_word_available": false,
  "speaker_id_available": false,
  "mqtt_connected": false,
  "active_rooms": 0,
  "active_sessions": 0
}

Privacy & Whisper Mode

Privacy LED

On Raspberry Pi, a GPIO-connected LED indicates when the microphone is active:

LED on: Audio is being captured and processed
LED off: No active voice session

Configure the BCM GPIO pin in [voice.gpio]. Requires aarch64 Linux — the feature compiles to a no-op on other platforms.

Whisper Mode

During quiet hours (default 22:00–07:00), whisper mode automatically:

Reduces TTS playback volume to the configured level (default 30%)
Uses softer TTS voice parameters when available
Restores normal volume outside quiet hours

Troubleshooting

Voice services not starting

# Check if voice profile is enabled
docker compose --profile voice ps

# Check individual service logs
docker compose --profile voice logs mosquitto
docker compose --profile voice logs openwakeword
docker compose --profile voice logs speaker-id

Wake word not detected

Verify openWakeWord is healthy: curl http://localhost:8083/health
Increase sensitivity (closer to 1.0) in [voice.wake_word]
Check audio format: must be PCM 16-bit LE, 16 kHz, mono
Check MQTT connectivity: mosquitto_sub -t 'pisovereign/audio/#' -v

Speaker not recognized

Ensure at least 3 enrollment samples are recorded
Lower match_threshold (default 0.75) if false negatives are high
Re-enroll in a quiet environment for better embeddings
Check speaker-id service health: curl http://localhost:8084/health

Room shows offline

Check satellite heartbeat interval (must be < heartbeat_timeout_ms)
Verify MQTT topic: pisovereign/status/{room_id}
Check Mosquitto logs for connection issues

GPIO LED not working

Only supported on Raspberry Pi (aarch64 Linux)
Verify BCM pin number matches physical wiring
Check GPIO permissions (user must be in gpio group or run as root)

PiSovereign Documentation