Token Optimization

The ai_tokenopt crate provides a full-spectrum, adaptive token optimization engine that reduces token usage by 40–60% on typical conversations while preserving semantic fidelity. It is context-window-aware, model-agnostic, and integrates seamlessly with the PiSovereign inference pipeline.


Architecture Overview

ai_tokenopt follows the same Hexagonal / Ports & Adapters pattern as the rest of the codebase. It has no mandatory runtime dependencies — all external integrations (LLM summarisation, HuggingFace tokenizer) are gated behind Cargo features or injected via port traits.

┌──────────────────────────────────────────────────────────┐
│                     ai_tokenopt crate                    │
│                                                          │
│  TokenOptimizer  ─────────────────────────────────────   │
│  │  1. cross-turn RAG dedup                              │
│  │  2. conciseness pressure injection                    │
│  │  3. progressive tool compression                      │
│  │  4. historical tool result truncation                 │
│  │  5. system prompt trim                                │
│  │  6. extractive history compaction                     │
│  │  7. LLM summarisation (optional, async)               │
│  │                                                       │
│  ├── TokenEstimator  (heuristic or HF tokenizer)         │
│  ├── TokenBudget     (adaptive window allocation)        │
│  ├── HistoryCompactor                                     │
│  ├── TemplateLoader  (runtime prompt overrides)          │
│  └── OptimizationMetrics (Prometheus)                    │
│                                                          │
│  integration surface:                                    │
│  ┌─────────────────────────────────────────────────┐     │
│  │ PiSovereign feature                             │     │
│  │  TokenOptimizedInferencePort (decorator)        │     │
│  │  domain types (ChatMessage, ToolDefinition …)   │     │
│  └─────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────┘

Impact-Ordered Pipeline

The optimizer runs strategies in descending impact order so that the cheapest, highest-gain operations always fire first.

flowchart TD
    A[Incoming conversation] --> B{Within budget?}
    B -- yes --> Z[Return unchanged]
    B -- no --> C[1. Cross-turn RAG dedup]
    C --> D[2. Conciseness pressure injection]
    D --> E[3. Progressive tool schema strip]
    E --> F[4. Historical tool result truncation]
    F --> G{Still over budget?}
    G -- no --> Z2[Return optimised]
    G -- yes --> H[5. System prompt trim]
    H --> I[6. Extractive history compaction]
    I --> J{LLM port provided?}
    J -- no --> Z2
    J -- yes --> K[7. LLM summarisation fallback]
    K --> Z2

Each fired step is recorded in OptimizationResult.plan with an estimated savings figure.


Integration Guide

Standalone (no PiSovereign types)

use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::{Conversation, ChatMessage};

let config = TokenOptimizationConfig {
    context_window_tokens: 8192,
    ..Default::default()
};

let optimizer = TokenOptimizer::new(config);

let conversation = Conversation::new(messages);
let result = optimizer.optimize_conversation(&conversation, None).await?;

// Inspect what fired
for step in &result.plan.steps {
    println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}

// Use the optimised conversation
send_to_llm(&result.conversation).await?;

With PiSovereign (decorator pattern)

use ai_tokenopt::TokenOptimizedInferencePort;

// Wraps any InferencePort — optimization is transparent to callers
let optimized_port = TokenOptimizedInferencePort::new(
    inner_port,
    Arc::new(optimizer),
);

With Tools

let result = optimizer
    .optimize_conversation_with_tools(&conversation, &tools, None)
    .await?;

// result.tools contains the selected + compressed tool subset
send_with_tools(&result.conversation, &result.tools).await?;

Configuration Reference

All fields are optional. Deserialise from the workspace config.toml:

[token_optimization]
enabled = true
context_window_tokens = 8192
response_headroom_ratio = 0.25
compaction_trigger_ratio = 0.70
max_summary_tokens = 256
system_prompt_budget_ratio = 0.15
rag_budget_ratio = 0.15
repetition_detection_enabled = true
repetition_ngram_size = 3
repetition_threshold = 0.3
max_tools_per_request = 8

# v2 enhancements
output_max_tokens = 512
frequency_penalty = 1.1
presence_penalty = 0.6
progressive_tool_compression = true
conciseness_pressure_threshold = 0.7
tool_result_max_tokens = 100
max_history_tokens = 4096
max_profile_prompt_tokens = 300
prompt_template_dir = "/etc/pisovereign/prompts"   # optional runtime overrides
tokenizer_model = "meta-llama/Llama-3.2-3B"        # optional; requires hf-tokenizer feature

Key Parameters

ParameterDefaultEffect
context_window_tokens8192Should match the model’s num_ctx; auto-detected at startup
compaction_trigger_ratio0.70Compact when history exceeds this fraction of the available history budget
conciseness_pressure_threshold0.70Injects a brevity directive into the system prompt above this usage ratio
tool_result_max_tokens100Historical tool messages are truncated to this many tokens
max_profile_prompt_tokens300Token cap for agent profile sections in the system prompt
prompt_template_dirnoneDirectory checked first for <name>.prompt.txt runtime overrides
progressive_tool_compressiontrueCompresses tool schemas that have appeared in recent turns

Runtime Prompt Template Overrides

Built-in prompt templates (summarisation, conciseness directive, etc.) are compiled into the binary from YAML. At runtime the TemplateLoader checks a configurable directory first, enabling zero-downtime prompt tuning without a redeploy:

# Drop a custom summarisation prompt into the overrides directory
echo "Summarise the conversation in ≤3 bullet points." \
  > /etc/pisovereign/prompts/summarize.prompt.txt

Set prompt_template_dir in config.toml or via the TOKEN_OPTIMIZATION__PROMPT_TEMPLATE_DIR environment variable to activate the override directory.


Observability

Tracing

All optimize_conversation and optimize_conversation_with_tools calls emit tracing spans at the info level. Fields present on each span:

FieldDescription
msgsNumber of messages in the incoming conversation
enabledWhether optimization is active for this call
toolsNumber of candidate tools (tool overload only)

Prometheus Metrics

The OptimizationMetrics struct registers the following counters and gauges under the token_opt_ prefix (exposed via the /metrics endpoint):

MetricTypeDescription
token_opt_optimizations_totalCounterTotal optimization calls
token_opt_compactions_totalCounterCalls that triggered compaction
token_opt_tokens_saved_totalCounterCumulative estimated tokens saved
token_opt_compression_ratioGaugeRolling compression ratio (saved ÷ original)

Feature Flags

FeatureDefaultDescription
pisovereignoffDomain-type integration + TokenOptimizedInferencePort
hf-tokenizeronHuggingFace tokenizers for precise per-token counts
ollamaoffOllamaSummarizationAdapter HTTP-based LLM compaction

Disable hf-tokenizer in resource-constrained environments to reduce compile times and binary size while still benefiting from the heuristic estimator.


Benchmarks

cargo bench -p ai_tokenopt
# HTML reports: target/criterion/report/index.html
Benchmark GroupWhat it measures
token_estimationHeuristic vs HF estimator throughput
budget_allocationWindow split across 5–200 messages
tool_compressionSchema compression + selection (5–50 tools)
history_compactionFull pipeline with forced compaction
full_pipelineEnd-to-end optimize_conversation[_with_tools]

Testing

Unit tests live inline in each module (#[cfg(test)] mod tests). Integration tests are in crates/ai_tokenopt/tests/. Property-based tests use proptest.

cargo test -p ai_tokenopt             # unit + integration tests
cargo test -p ai_tokenopt --features pisovereign  # with domain types

Tip: When constructing ChatMessage::tool(...) in tests, use the make_tool_msg() / tool_msg() cfg-conditional helper defined in the test modules — the pisovereign feature changes the constructor signature.