Adaptive Model Routing
Complexity-based request routing to reduce latency and resource usage
Overview
Model routing classifies every incoming message into one of four complexity tiers and routes it to an appropriately sized LLM model — or answers trivially without calling any model at all.
| Tier | Default Model | Typical Latency | Use Case |
|---|---|---|---|
| Trivial | template (no LLM) | <10 ms | Greetings, thanks, farewells |
| Simple | gemma3:1b | ~0.5 s | Short factual questions |
| Moderate | gemma3:4b | ~2 s | Multi-turn conversations, explanations |
| Complex | gemma3:12b | ~6 s | Code generation, analysis, creative writing |
Goal: Route 60–70% of queries to the Trivial or Simple tier, reducing average response time from ~8 s to ~3 s.
Configuration
Enable in config.toml:
[model_routing]
enabled = true
[model_routing.models]
trivial = "template" # No LLM call
simple = "gemma3:1b"
moderate = "gemma3:4b"
complex = "gemma3:12b"
[model_routing.classification]
confidence_threshold = 0.6
max_simple_words = 15
max_simple_chars = 100
max_moderate_sentences = 5
complex_min_words = 50
complex_keywords = [
"code", "implement", "explain", "analyze",
"compare", "debug", "refactor", "translate"
]
trivial_patterns = [
"^hi$", "^hello$", "^hey$", "^hallo$",
"^moin$", "^danke$", "^thanks$"
]
[model_routing.templates]
greeting = ["Hello! How can I help?", "Hallo! Wie kann ich helfen?"]
farewell = ["Goodbye!", "Tschüss!"]
thanks = ["You're welcome!", "Gerne!"]
help = ["I can help with questions, tasks, weather, transit, and more."]
system_info = ["PiSovereign — your private AI assistant."]
unknown = ["How can I help you?", "Wie kann ich Ihnen helfen?"]
Docker Compose
When routing is enabled, Ollama needs to keep multiple models loaded. Set in compose.yml:
OLLAMA_MAX_LOADED_MODELS: 2
This allows the small and large models to stay warm in memory simultaneously.
How Classification Works
The rule-based classifier runs synchronously (no LLM call) and takes <1 ms:
- Trivial detection: Regex patterns, emoji-only, empty input → instant template
- Complex detection: Code patterns (backticks, keywords), high word count (≥50), configured keywords → large model
- Simple detection: Short messages (≤15 words, ≤100 chars), single sentence, no conversation history → small model
- Moderate fallback: Everything else, or follow-up messages in an ongoing conversation
Confidence & Tier Upgrades
Each classification includes a confidence score (0.0–1.0). When confidence falls below the confidence_threshold (default: 0.6), the classifier upgrades to the next higher tier:
- Simple → Moderate
- Moderate → Complex
This ensures borderline cases use a more capable model rather than risk a poor response.
Metrics
Model routing exposes Prometheus metrics at /metrics/prometheus:
model_routing_requests_total{tier="trivial"} 142
model_routing_requests_total{tier="simple"} 89
model_routing_requests_total{tier="moderate"} 45
model_routing_requests_total{tier="complex"} 24
model_routing_template_hits_total 142
model_routing_upgrades_total 12
The JSON /metrics endpoint also includes a model_routing object when routing is enabled.
Decorator Chain
When model routing is enabled, the inference decorator chain becomes:
Per tier:
OllamaInferenceAdapter(tier_model)
→ DegradedInferenceAdapter (per-tier circuit breaker)
ModelRoutingAdapter
→ classifies message → selects tier adapter
→ delegates to appropriate tier
CachedInferenceAdapter (shared across all tiers)
→ SanitizedInferencePort (shared output filter)
→ ChatService
When disabled, the chain is the standard single-model path:
OllamaInferenceAdapter → Degraded → Cached → Sanitized → ChatService
Backward Compatibility
- The old
[model_selector]configuration is deprecated since v0.6.0 - Setting
model_routing.enabled = false(or omitting the section) preserves the original single-model behavior - No breaking changes to the
InferencePorttrait or HTTP API