Token Optimization
The ai_tokenopt crate provides a full-spectrum, adaptive token optimization engine that
reduces token usage by 40–60% on typical conversations while preserving semantic fidelity.
It is context-window-aware, model-agnostic, and integrates seamlessly with the PiSovereign
inference pipeline.
Architecture Overview
ai_tokenopt follows the same Hexagonal / Ports & Adapters pattern as the rest of the
codebase. It has no mandatory runtime dependencies — all external integrations (LLM
summarisation, HuggingFace tokenizer) are gated behind Cargo features or injected via
port traits.
┌──────────────────────────────────────────────────────────┐
│ ai_tokenopt crate │
│ │
│ TokenOptimizer ───────────────────────────────────── │
│ │ 1. cross-turn RAG dedup │
│ │ 2. conciseness pressure injection │
│ │ 3. progressive tool compression │
│ │ 4. historical tool result truncation │
│ │ 5. system prompt trim │
│ │ 6. extractive history compaction │
│ │ 7. LLM summarisation (optional, async) │
│ │ │
│ ├── TokenEstimator (heuristic or HF tokenizer) │
│ ├── TokenBudget (adaptive window allocation) │
│ ├── HistoryCompactor │
│ ├── TemplateLoader (runtime prompt overrides) │
│ └── OptimizationMetrics (Prometheus) │
│ │
│ integration surface: │
│ ┌─────────────────────────────────────────────────┐ │
│ │ PiSovereign feature │ │
│ │ TokenOptimizedInferencePort (decorator) │ │
│ │ domain types (ChatMessage, ToolDefinition …) │ │
│ └─────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
Impact-Ordered Pipeline
The optimizer runs strategies in descending impact order so that the cheapest, highest-gain operations always fire first.
flowchart TD
A[Incoming conversation] --> B{Within budget?}
B -- yes --> Z[Return unchanged]
B -- no --> C[1. Cross-turn RAG dedup]
C --> D[2. Conciseness pressure injection]
D --> E[3. Progressive tool schema strip]
E --> F[4. Historical tool result truncation]
F --> G{Still over budget?}
G -- no --> Z2[Return optimised]
G -- yes --> H[5. System prompt trim]
H --> I[6. Extractive history compaction]
I --> J{LLM port provided?}
J -- no --> Z2
J -- yes --> K[7. LLM summarisation fallback]
K --> Z2
Each fired step is recorded in OptimizationResult.plan with an estimated savings figure.
Integration Guide
Standalone (no PiSovereign types)
use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::{Conversation, ChatMessage};
let config = TokenOptimizationConfig {
context_window_tokens: 8192,
..Default::default()
};
let optimizer = TokenOptimizer::new(config);
let conversation = Conversation::new(messages);
let result = optimizer.optimize_conversation(&conversation, None).await?;
// Inspect what fired
for step in &result.plan.steps {
println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}
// Use the optimised conversation
send_to_llm(&result.conversation).await?;
With PiSovereign (decorator pattern)
use ai_tokenopt::TokenOptimizedInferencePort;
// Wraps any InferencePort — optimization is transparent to callers
let optimized_port = TokenOptimizedInferencePort::new(
inner_port,
Arc::new(optimizer),
);
With Tools
let result = optimizer
.optimize_conversation_with_tools(&conversation, &tools, None)
.await?;
// result.tools contains the selected + compressed tool subset
send_with_tools(&result.conversation, &result.tools).await?;
Configuration Reference
All fields are optional. Deserialise from the workspace config.toml:
[token_optimization]
enabled = true
context_window_tokens = 8192
response_headroom_ratio = 0.25
compaction_trigger_ratio = 0.70
max_summary_tokens = 256
system_prompt_budget_ratio = 0.15
rag_budget_ratio = 0.15
repetition_detection_enabled = true
repetition_ngram_size = 3
repetition_threshold = 0.3
max_tools_per_request = 8
# v2 enhancements
output_max_tokens = 512
frequency_penalty = 1.1
presence_penalty = 0.6
progressive_tool_compression = true
conciseness_pressure_threshold = 0.7
tool_result_max_tokens = 100
max_history_tokens = 4096
max_profile_prompt_tokens = 300
prompt_template_dir = "/etc/pisovereign/prompts" # optional runtime overrides
tokenizer_model = "meta-llama/Llama-3.2-3B" # optional; requires hf-tokenizer feature
Key Parameters
| Parameter | Default | Effect |
|---|---|---|
context_window_tokens | 8192 | Should match the model’s num_ctx; auto-detected at startup |
compaction_trigger_ratio | 0.70 | Compact when history exceeds this fraction of the available history budget |
conciseness_pressure_threshold | 0.70 | Injects a brevity directive into the system prompt above this usage ratio |
tool_result_max_tokens | 100 | Historical tool messages are truncated to this many tokens |
max_profile_prompt_tokens | 300 | Token cap for agent profile sections in the system prompt |
prompt_template_dir | none | Directory checked first for <name>.prompt.txt runtime overrides |
progressive_tool_compression | true | Compresses tool schemas that have appeared in recent turns |
Runtime Prompt Template Overrides
Built-in prompt templates (summarisation, conciseness directive, etc.) are compiled into
the binary from YAML. At runtime the TemplateLoader checks a configurable directory
first, enabling zero-downtime prompt tuning without a redeploy:
# Drop a custom summarisation prompt into the overrides directory
echo "Summarise the conversation in ≤3 bullet points." \
> /etc/pisovereign/prompts/summarize.prompt.txt
Set prompt_template_dir in config.toml or via the TOKEN_OPTIMIZATION__PROMPT_TEMPLATE_DIR
environment variable to activate the override directory.
Observability
Tracing
All optimize_conversation and optimize_conversation_with_tools calls emit tracing
spans at the info level. Fields present on each span:
| Field | Description |
|---|---|
msgs | Number of messages in the incoming conversation |
enabled | Whether optimization is active for this call |
tools | Number of candidate tools (tool overload only) |
Prometheus Metrics
The OptimizationMetrics struct registers the following counters and gauges under the
token_opt_ prefix (exposed via the /metrics endpoint):
| Metric | Type | Description |
|---|---|---|
token_opt_optimizations_total | Counter | Total optimization calls |
token_opt_compactions_total | Counter | Calls that triggered compaction |
token_opt_tokens_saved_total | Counter | Cumulative estimated tokens saved |
token_opt_compression_ratio | Gauge | Rolling compression ratio (saved ÷ original) |
Feature Flags
| Feature | Default | Description |
|---|---|---|
pisovereign | off | Domain-type integration + TokenOptimizedInferencePort |
hf-tokenizer | on | HuggingFace tokenizers for precise per-token counts |
ollama | off | OllamaSummarizationAdapter HTTP-based LLM compaction |
Disable hf-tokenizer in resource-constrained environments to reduce compile times and
binary size while still benefiting from the heuristic estimator.
Benchmarks
cargo bench -p ai_tokenopt
# HTML reports: target/criterion/report/index.html
| Benchmark Group | What it measures |
|---|---|
token_estimation | Heuristic vs HF estimator throughput |
budget_allocation | Window split across 5–200 messages |
tool_compression | Schema compression + selection (5–50 tools) |
history_compaction | Full pipeline with forced compaction |
full_pipeline | End-to-end optimize_conversation[_with_tools] |
Testing
Unit tests live inline in each module (#[cfg(test)] mod tests). Integration tests are in
crates/ai_tokenopt/tests/. Property-based tests use proptest.
cargo test -p ai_tokenopt # unit + integration tests
cargo test -p ai_tokenopt --features pisovereign # with domain types
Tip: When constructing
ChatMessage::tool(...)in tests, use themake_tool_msg()/tool_msg()cfg-conditional helper defined in the test modules — thepisovereignfeature changes the constructor signature.