Adaptive Model Routing

Complexity-based request routing to reduce latency and resource usage

Overview

Model routing classifies every incoming message into one of four complexity tiers and routes it to an appropriately sized LLM model — or answers trivially without calling any model at all.

Tier	Default Model	Typical Latency	Use Case
Trivial	`template` (no LLM)	<10 ms	Greetings, thanks, farewells
Simple	`gemma4:e4b`	~0.5 s	Short factual questions
Moderate	`gemma4:26b`	~2 s	Multi-turn conversations, explanations
Complex	`gemma4:31b`	~6 s	Code generation, analysis, creative writing

Goal: Route 60–70% of queries to the Trivial or Simple tier, reducing average response time from ~8 s to ~3 s.

Configuration

Enable in config.toml:

[model_routing]
enabled = true

[model_routing.models]
trivial = "template"       # No LLM call
simple = "gemma4:e4b"
moderate = "gemma4:26b"
complex = "gemma4:31b"

[model_routing.classification]
confidence_threshold = 0.6
max_simple_words = 15
max_simple_chars = 100
max_moderate_sentences = 5
complex_min_words = 50
complex_keywords = [
    "code", "implement", "explain", "analyze",
    "compare", "debug", "refactor", "translate"
]
trivial_patterns = [
    "^hi$", "^hello$", "^hey$", "^hallo$",
    "^moin$", "^danke$", "^thanks$"
]

[model_routing.templates]
greeting = ["Hello! How can I help?", "Hallo! Wie kann ich helfen?"]
farewell = ["Goodbye!", "Tschüss!"]
thanks = ["You're welcome!", "Gerne!"]
help = ["I can help with questions, tasks, weather, transit, and more."]
system_info = ["PiSovereign — your private AI assistant."]
unknown = ["How can I help you?", "Wie kann ich Ihnen helfen?"]

Docker Compose

When routing is enabled, Ollama needs to keep multiple models loaded. Set in compose.yml:

OLLAMA_MAX_LOADED_MODELS: 3

This allows the small and large models to stay warm in memory simultaneously.

How Classification Works

The rule-based classifier runs synchronously (no LLM call) and takes <1 ms:

Trivial detection: Regex patterns, emoji-only, empty input → instant template
Complex detection: Code patterns (backticks, keywords), high word count (≥50), configured keywords → large model
Simple detection: Short messages (≤15 words, ≤100 chars), single sentence, no conversation history → small model
Moderate fallback: Everything else, or follow-up messages in an ongoing conversation

Confidence & Tier Upgrades

Each classification includes a confidence score (0.0–1.0). When confidence falls below the confidence_threshold (default: 0.6), the classifier upgrades to the next higher tier:

Simple → Moderate
Moderate → Complex

This ensures borderline cases use a more capable model rather than risk a poor response.

Metrics

Model routing exposes Prometheus metrics at /metrics/prometheus:

model_routing_requests_total{tier="trivial"} 142
model_routing_requests_total{tier="simple"} 89
model_routing_requests_total{tier="moderate"} 45
model_routing_requests_total{tier="complex"} 24
model_routing_template_hits_total 142
model_routing_upgrades_total 12

The JSON /metrics endpoint also includes a model_routing object when routing is enabled.

Decorator Chain

When model routing is enabled, the inference decorator chain becomes:

Per tier:
  OllamaInferenceAdapter(tier_model)
    → DegradedInferenceAdapter (per-tier circuit breaker)

ModelRoutingAdapter
  → classifies message → selects tier adapter
  → delegates to appropriate tier

CachedInferenceAdapter (shared across all tiers)
  → SanitizedInferencePort (shared output filter)
    → ChatService

When disabled, the chain is the standard single-model path:

OllamaInferenceAdapter → Degraded → Cached → Sanitized → ChatService

Backward Compatibility

The old [model_selector] configuration is deprecated since v0.6.0
Setting model_routing.enabled = false (or omitting the section) preserves the original single-model behavior
No breaking changes to the InferencePort trait or HTTP API

PiSovereign Documentation