Voice-First Interface

πŸŽ™οΈ Multi-room voice interaction with wake word detection, speaker identification, and privacy controls

This document covers the architecture, data flow, configuration, and deployment of PiSovereign’s voice-first interface.

Table of Contents


Overview

The voice-first interface turns PiSovereign into a hands-free AI assistant. Audio streams from satellite devices (Raspberry Pi Zero 2W, ESP32-S3, etc.) over MQTT. The server handles wake word detection, speech-to-text, LLM inference, text-to-speech, and speaker identification β€” all locally, with no cloud dependency.

Key capabilities:

  • Multi-room audio β€” Independent rooms with individual volume, mute, and online status via MQTT
  • Wake word detection β€” β€œSovereign” (customizable) via openWakeWord with configurable sensitivity
  • Continuous conversation β€” Follow-up window after a response, no re-triggering needed
  • Speaker identification β€” ECAPA-TDNN embeddings match voice to enrolled user profiles
  • Whisper mode β€” Automatic volume reduction during quiet hours
  • Privacy LED β€” GPIO-controlled indicator when microphone is active (Raspberry Pi only)

Architecture

Component Diagram

graph TB
    subgraph Satellites["Room Satellites"]
        SAT1["πŸŽ™οΈ Kitchen<br/>RPi Zero 2W"]
        SAT2["πŸŽ™οΈ Office<br/>ESP32-S3"]
        SAT3["πŸŽ™οΈ Bedroom<br/>RPi Zero 2W"]
    end

    subgraph Docker["Docker Stack (voice profile)"]
        MQTT["Mosquitto<br/>MQTT Broker"]
        OWW["openWakeWord<br/>Wake Word Detection"]
        SID["Speaker-ID<br/>ECAPA-TDNN"]
        APP["PiSovereign<br/>Voice Pipeline"]
        STT["Whisper<br/>Speech-to-Text"]
        TTS["Piper<br/>Text-to-Speech"]
        LLM["Ollama<br/>LLM Inference"]
    end

    SAT1 & SAT2 & SAT3 -->|"Audio PCM<br/>QoS 0"| MQTT
    MQTT -->|"Subscribe"| APP
    APP -->|"Audio chunks"| OWW
    APP -->|"Voice sample"| SID
    APP -->|"Speech audio"| STT
    APP -->|"Response text"| TTS
    APP -->|"User query"| LLM
    APP -->|"TTS audio<br/>QoS 1"| MQTT
    MQTT -->|"Playback"| SAT1 & SAT2 & SAT3

Audio Pipeline Flow

sequenceDiagram
    participant Sat as Room Satellite
    participant MQTT as Mosquitto
    participant VP as Voice Pipeline
    participant WW as openWakeWord
    participant STT as Whisper STT
    participant LLM as Ollama
    participant TTS as Piper TTS
    participant SID as Speaker-ID

    Sat->>MQTT: Publish audio (PCM 16-bit, 16kHz)
    MQTT->>VP: Deliver audio chunk
    VP->>WW: Check for wake word
    WW-->>VP: Detection (word, confidence)

    Note over VP: Wake word detected β†’ start session

    VP->>SID: Identify speaker (audio sample)
    SID-->>VP: SpeakerMatch {id, confidence}
    VP->>VP: Activate privacy LED (GPIO)
    VP->>STT: Transcribe speech
    STT-->>VP: Transcription text
    VP->>LLM: Generate response
    LLM-->>VP: Response text
    VP->>TTS: Synthesize speech
    TTS-->>VP: Audio PCM
    VP->>MQTT: Publish response audio
    MQTT->>Sat: Play response

    Note over VP: Enter follow-up window (10s default)

Voice Session State Machine

stateDiagram-v2
    [*] --> Listening: Wake word detected
    Listening --> Processing: Speech captured
    Processing --> Responding: LLM response ready
    Responding --> FollowUp: TTS playback complete
    FollowUp --> Listening: Follow-up speech detected
    FollowUp --> Ended: Timeout (10s default)
    Listening --> Ended: Max duration (5 min)
    Processing --> Ended: Error / timeout
    Responding --> Ended: Error
StateDescriptionAccepts Input
ListeningMicrophone active, capturing speechβœ…
ProcessingSTT β†’ LLM pipeline running❌
RespondingTTS playing response audio❌
FollowUpWaiting for follow-up within time windowβœ…
EndedSession terminated❌

Domain Model

Voice Room

A VoiceRoom represents a physical location with a satellite audio device.

FieldTypeDescription
idRoomId (UUID)Auto-generated unique identifier
nameStringHuman-readable name (e.g., β€œKitchen”)
is_onlineboolWhether the satellite is sending heartbeats
volumeu8Playback volume (0–100)
mutedboolWhether audio output is suppressed
last_seenDateTime<Utc>Last heartbeat timestamp
created_atDateTime<Utc>Registration timestamp

Rooms go offline automatically when heartbeats stop arriving (configurable timeout, default 30s).

Voice Session

A VoiceSession tracks a single voice interaction within a room.

FieldTypeDescription
idVoiceSessionId (UUID)Auto-generated session identifier
room_idRoomIdRoom where the session is active
speaker_idOption<SpeakerId>Identified speaker (if enrolled)
conversation_idConversationIdLinks to LLM conversation context
stateVoiceSessionStateCurrent FSM state (see diagram above)
follow_up_window_msu64How long to wait for follow-up (default 10s)
exchange_countu32Number of user↔assistant exchanges

Voice Profile & Speaker Identification

A VoiceProfile stores enrolled speaker embeddings for recognition.

  • Embedding model: ECAPA-TDNN (SpeechBrain), 192-dimensional vectors
  • Minimum enrollment: 3 audio samples for reliable identification
  • Matching: Cosine similarity between embeddings, configurable threshold (default 0.75)
  • Audio format: PCM 16-bit little-endian, 16 kHz, mono
graph LR
    subgraph Enrollment["Speaker Enrollment (3+ samples)"]
        A1["🎀 Sample 1"] --> E1["Embedding"]
        A2["🎀 Sample 2"] --> E2["Embedding"]
        A3["🎀 Sample 3"] --> E3["Embedding"]
    end

    subgraph Recognition["Runtime Recognition"]
        AX["🎀 Live Audio"] --> EX["Live Embedding"]
        EX --> COS["Cosine Similarity"]
        E1 & E2 & E3 --> COS
        COS -->|"> 0.75"| MATCH["βœ… Speaker Identified"]
        COS -->|"≀ 0.75"| UNKNOWN["❓ Unknown Speaker"]
    end

Application Services

VoiceRoomService

Manages room lifecycle and health monitoring.

MethodDescription
register_room(name)Register a new room satellite
heartbeat(room_id)Update last-seen timestamp, mark online
check_stale_rooms()Mark rooms offline if heartbeat timeout exceeded
set_volume(room_id, volume)Adjust playback volume (0–100)
toggle_mute(room_id)Toggle mute state
list_rooms()List all registered rooms
remove_room(room_id)Unregister a room

A background task calls check_stale_rooms() every 30 seconds.

VoiceSessionService

Manages voice session state transitions.

MethodDescription
start_or_resume_session(room_id)Start new session or resume from FollowUp
mark_processing(session_id)Transition to Processing state
mark_responding(session_id)Transition to Responding state
enter_follow_up(session_id)Transition to FollowUp state
end_session(session_id)Terminate session
get_active_session(room_id)Query current session for a room

SpeakerEnrollmentService

Manages speaker profile enrollment and identification data.

MethodDescription
add_enrollment_sample(user_id, name, audio)Add voice sample (creates profile if needed)
enrollment_status(user_id)Check enrollment progress
delete_profile(speaker_id)Remove all speaker data
list_profiles()List all enrolled speakers
rename_profile(speaker_id, name)Update display name

VoicePipelineService

Orchestrates the full voice interaction pipeline (wake word β†’ STT β†’ LLM β†’ TTS).

MethodDescription
process_voice_command(room_id, audio)Full pipeline: detect speaker, transcribe, infer, synthesize
is_wake_word_available()Health check for wake word service
is_speaker_id_available()Health check for speaker-id service

Port Traits

The voice subsystem defines 7 port traits in crates/application/src/ports/:

PortPurpose
WakeWordPortDetect wake words in audio chunks
SpeakerIdentificationPortIdentify/enroll speakers from audio
MqttPortPublish/subscribe to MQTT topics
VoiceRoomPortPersist room entities
VoiceSessionPortPersist session state
VoiceProfileStorePersist speaker profiles and embeddings
GpioPortControl privacy LED (RPi only)

All ports use #[async_trait] and support mockall via #[cfg_attr(test, automock)].


MQTT Topic Structure

All topics use a configurable prefix (default: pisovereign).

Topic PatternQoSDirectionDescription
{prefix}/audio/{room_id}/input0Satellite β†’ ServerRaw PCM audio stream
{prefix}/audio/{room_id}/output1Server β†’ SatelliteTTS response audio
{prefix}/wake/{room_id}1Server β†’ SatelliteWake word detection event
{prefix}/control/{room_id}1BidirectionalVolume, mute, and room control commands
{prefix}/status/{room_id}0Satellite β†’ ServerHeartbeat and status updates

Audio format: PCM 16-bit little-endian, 16 kHz, mono.


Docker Deployment

Services

The voice stack consists of three services, activated via the voice Docker Compose profile:

ServiceImagePortMemoryPurpose
mosquittoeclipse-mosquitto:21883 (internal)128 MBMQTT message broker
openwakewordCustom (Python)8083 (internal)512 MBWake word detection via openWakeWord
speaker-idCustom (Python)8084 (internal)512 MBSpeaker identification via SpeechBrain ECAPA-TDNN

All services run on the internal pisovereign-network and are never exposed externally.

Starting the Voice Stack

# Start core + voice services
just docker-up  # if voice profile is in COMPOSE_PROFILES

# Or explicitly with the voice profile
docker compose --profile voice up -d

# Verify services are healthy
docker compose --profile voice ps

Resource Requirements

PlatformRAM (voice stack)Notes
Raspberry Pi 5 (8 GB)~1.2 GBRecommended minimum for voice + core
x86_64 Desktop~1.0 GBFaster model loading

The speaker-id service downloads ECAPA-TDNN models (~80 MB) on first start. Models are persisted in the speaker-id-models Docker volume.


Configuration Reference

Add to config.toml to enable the voice interface:

[voice]
enabled = true

[voice.mqtt]
broker_url = "mqtt://mosquitto:1883"
client_id = "pisovereign-voice"
keep_alive_secs = 30
max_inflight = 100
topic_prefix = "pisovereign"

[voice.wake_word]
service_url = "http://openwakeword:8083"
words = ["sovereign"]
sensitivity = 0.5          # 0.0–1.0, higher = fewer false positives
timeout_ms = 2000

[voice.speaker_id]
service_url = "http://speaker-id:8084"
min_enrollment_samples = 3
match_threshold = 0.75     # Cosine similarity threshold
timeout_ms = 5000

[voice.conversation]
follow_up_window_ms = 10000    # 10s follow-up after response
max_session_duration_ms = 300000  # 5 minutes max

[voice.whisper_mode]
enabled = true
quiet_start = "22:00"
quiet_end = "07:00"
quiet_volume = 30          # 0–100

[voice.gpio]
enabled = false            # Only on Raspberry Pi (aarch64 Linux)
privacy_led_pin = 17       # BCM GPIO pin number

[voice.rooms]
default_volume = 80        # 0–100
heartbeat_timeout_ms = 30000  # Mark offline after 30s silence

All values shown are defaults. The [voice] section is optional β€” when omitted, the voice subsystem is disabled.


REST API Endpoints

All voice endpoints require API key authentication.

Room Management

MethodPathDescription
GET/v1/voice/roomsList all registered rooms
POST/v1/voice/roomsRegister a new room
DELETE/v1/voice/rooms/{room_id}Remove a room
GET/v1/voice/rooms/{room_id}/sessionGet active session for a room

Speaker Management

MethodPathDescription
GET/v1/voice/speakersList enrolled speaker profiles
DELETE/v1/voice/speakers/{speaker_id}Delete a speaker profile

Status

MethodPathDescription
GET/v1/voice/statusVoice subsystem health check

Example: Register a Room

curl -X POST http://localhost:3000/v1/voice/rooms \
  -H "Authorization: Bearer sk-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{"name": "Kitchen"}'

Response:

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "Kitchen",
  "is_online": false,
  "volume": 80,
  "muted": false,
  "last_seen": "2025-01-15T10:30:00Z",
  "created_at": "2025-01-15T10:30:00Z"
}

Example: Voice Status

curl http://localhost:3000/v1/voice/status \
  -H "Authorization: Bearer sk-your-api-key"

Response (voice disabled):

{
  "enabled": false,
  "wake_word_available": false,
  "speaker_id_available": false,
  "mqtt_connected": false,
  "active_rooms": 0,
  "active_sessions": 0
}

Privacy & Whisper Mode

Privacy LED

On Raspberry Pi, a GPIO-connected LED indicates when the microphone is active:

  • LED on: Audio is being captured and processed
  • LED off: No active voice session

Configure the BCM GPIO pin in [voice.gpio]. Requires aarch64 Linux β€” the feature compiles to a no-op on other platforms.

Whisper Mode

During quiet hours (default 22:00–07:00), whisper mode automatically:

  1. Reduces TTS playback volume to the configured level (default 30%)
  2. Uses softer TTS voice parameters when available
  3. Restores normal volume outside quiet hours

Troubleshooting

Voice services not starting

# Check if voice profile is enabled
docker compose --profile voice ps

# Check individual service logs
docker compose --profile voice logs mosquitto
docker compose --profile voice logs openwakeword
docker compose --profile voice logs speaker-id

Wake word not detected

  • Verify openWakeWord is healthy: curl http://localhost:8083/health
  • Increase sensitivity (closer to 1.0) in [voice.wake_word]
  • Check audio format: must be PCM 16-bit LE, 16 kHz, mono
  • Check MQTT connectivity: mosquitto_sub -t 'pisovereign/audio/#' -v

Speaker not recognized

  • Ensure at least 3 enrollment samples are recorded
  • Lower match_threshold (default 0.75) if false negatives are high
  • Re-enroll in a quiet environment for better embeddings
  • Check speaker-id service health: curl http://localhost:8084/health

Room shows offline

  • Check satellite heartbeat interval (must be < heartbeat_timeout_ms)
  • Verify MQTT topic: pisovereign/status/{room_id}
  • Check Mosquitto logs for connection issues

GPIO LED not working

  • Only supported on Raspberry Pi (aarch64 Linux)
  • Verify BCM pin number matches physical wiring
  • Check GPIO permissions (user must be in gpio group or run as root)