Token Optimization

The ai_tokenopt crate provides a full-spectrum, adaptive token optimization engine that reduces token usage by 40–60% on typical conversations while preserving semantic fidelity. It is context-window-aware, model-agnostic, and integrates seamlessly with the PiSovereign inference pipeline.

Architecture Overview

ai_tokenopt follows the same Hexagonal / Ports & Adapters pattern as the rest of the codebase. It has no mandatory runtime dependencies — all external integrations (LLM summarisation, HuggingFace tokenizer) are gated behind Cargo features or injected via port traits.

┌──────────────────────────────────────────────────────────┐
│                     ai_tokenopt crate                    │
│                                                          │
│  TokenOptimizer  ─────────────────────────────────────   │
│  │  1. cross-turn RAG dedup                              │
│  │  2. conciseness pressure injection                    │
│  │  3. progressive tool compression                      │
│  │  4. historical tool result truncation                 │
│  │  5. system prompt trim                                │
│  │  6. extractive history compaction                     │
│  │  7. LLM summarisation (optional, async)               │
│  │                                                       │
│  ├── TokenEstimator  (heuristic or HF tokenizer)         │
│  ├── TokenBudget     (adaptive window allocation)        │
│  ├── HistoryCompactor                                     │
│  ├── TemplateLoader  (runtime prompt overrides)          │
│  └── OptimizationMetrics (Prometheus)                    │
│                                                          │
│  integration surface:                                    │
│  ┌─────────────────────────────────────────────────┐     │
│  │ PiSovereign feature                             │     │
│  │  TokenOptimizedInferencePort (decorator)        │     │
│  │  domain types (ChatMessage, ToolDefinition …)   │     │
│  └─────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────┘

Impact-Ordered Pipeline

The optimizer runs strategies in descending impact order so that the cheapest, highest-gain operations always fire first.

flowchart TD
    A[Incoming conversation] --> B{Within budget?}
    B -- yes --> Z[Return unchanged]
    B -- no --> C[1. Cross-turn RAG dedup]
    C --> D[2. Conciseness pressure injection]
    D --> E[3. Progressive tool schema strip]
    E --> F[4. Historical tool result truncation]
    F --> G{Still over budget?}
    G -- no --> Z2[Return optimised]
    G -- yes --> H[5. System prompt trim]
    H --> I[6. Extractive history compaction]
    I --> J{LLM port provided?}
    J -- no --> Z2
    J -- yes --> K[7. LLM summarisation fallback]
    K --> Z2

Each fired step is recorded in OptimizationResult.plan with an estimated savings figure.

Integration Guide

Standalone (no PiSovereign types)

use ai_tokenopt::{TokenOptimizer, TokenOptimizationConfig};
use ai_tokenopt::types::{Conversation, ChatMessage};

let config = TokenOptimizationConfig {
    context_window_tokens: 8192,
    ..Default::default()
};

let optimizer = TokenOptimizer::new(config);

let conversation = Conversation::new(messages);
let result = optimizer.optimize_conversation(&conversation, None).await?;

// Inspect what fired
for step in &result.plan.steps {
    println!("{}: ~{} tokens saved", step.name, step.estimated_savings);
}

// Use the optimised conversation
send_to_llm(&result.conversation).await?;

With PiSovereign (decorator pattern)

use ai_tokenopt::TokenOptimizedInferencePort;

// Wraps any InferencePort — optimization is transparent to callers
let optimized_port = TokenOptimizedInferencePort::new(
    inner_port,
    Arc::new(optimizer),
);

With Tools

let result = optimizer
    .optimize_conversation_with_tools(&conversation, &tools, None)
    .await?;

// result.tools contains the selected + compressed tool subset
send_with_tools(&result.conversation, &result.tools).await?;

Configuration Reference

All fields are optional. Deserialise from the workspace config.toml:

[token_optimization]
enabled = true
context_window_tokens = 8192
response_headroom_ratio = 0.25
compaction_trigger_ratio = 0.70
max_summary_tokens = 256
system_prompt_budget_ratio = 0.15
rag_budget_ratio = 0.15
repetition_detection_enabled = true
repetition_ngram_size = 3
repetition_threshold = 0.3
max_tools_per_request = 8

# v2 enhancements
output_max_tokens = 512
frequency_penalty = 1.1
presence_penalty = 0.6
progressive_tool_compression = true
conciseness_pressure_threshold = 0.7
tool_result_max_tokens = 100
max_history_tokens = 4096
max_profile_prompt_tokens = 300
prompt_template_dir = "/etc/pisovereign/prompts"   # optional runtime overrides
tokenizer_model = "meta-llama/Llama-3.2-3B"        # optional; requires hf-tokenizer feature

Key Parameters

Parameter	Default	Effect
`context_window_tokens`	8192	Should match the model’s `num_ctx`; auto-detected at startup
`compaction_trigger_ratio`	0.70	Compact when history exceeds this fraction of the available history budget
`conciseness_pressure_threshold`	0.70	Injects a brevity directive into the system prompt above this usage ratio
`tool_result_max_tokens`	100	Historical tool messages are truncated to this many tokens
`max_profile_prompt_tokens`	300	Token cap for agent profile sections in the system prompt
`prompt_template_dir`	none	Directory checked first for `<name>.prompt.txt` runtime overrides
`progressive_tool_compression`	true	Compresses tool schemas that have appeared in recent turns

Runtime Prompt Template Overrides

Built-in prompt templates (summarisation, conciseness directive, etc.) are compiled into the binary from YAML. At runtime the TemplateLoader checks a configurable directory first, enabling zero-downtime prompt tuning without a redeploy:

# Drop a custom summarisation prompt into the overrides directory
echo "Summarise the conversation in ≤3 bullet points." \
  > /etc/pisovereign/prompts/summarize.prompt.txt

Set prompt_template_dir in config.toml or via the TOKEN_OPTIMIZATION__PROMPT_TEMPLATE_DIR environment variable to activate the override directory.

Observability

Tracing

All optimize_conversation and optimize_conversation_with_tools calls emit tracing spans at the info level. Fields present on each span:

Field	Description
`msgs`	Number of messages in the incoming conversation
`enabled`	Whether optimization is active for this call
`tools`	Number of candidate tools (tool overload only)

Prometheus Metrics

The OptimizationMetrics struct registers the following counters and gauges under the token_opt_ prefix (exposed via the /metrics endpoint):

Metric	Type	Description
`token_opt_optimizations_total`	Counter	Total optimization calls
`token_opt_compactions_total`	Counter	Calls that triggered compaction
`token_opt_tokens_saved_total`	Counter	Cumulative estimated tokens saved
`token_opt_compression_ratio`	Gauge	Rolling compression ratio (saved ÷ original)

Feature Flags

Feature	Default	Description
`pisovereign`	off	Domain-type integration + `TokenOptimizedInferencePort`
`hf-tokenizer`	on	HuggingFace `tokenizers` for precise per-token counts
`ollama`	off	`OllamaSummarizationAdapter` HTTP-based LLM compaction

Disable hf-tokenizer in resource-constrained environments to reduce compile times and binary size while still benefiting from the heuristic estimator.

Benchmarks

cargo bench -p ai_tokenopt
# HTML reports: target/criterion/report/index.html

Benchmark Group	What it measures
`token_estimation`	Heuristic vs HF estimator throughput
`budget_allocation`	Window split across 5–200 messages
`tool_compression`	Schema compression + selection (5–50 tools)
`history_compaction`	Full pipeline with forced compaction
`full_pipeline`	End-to-end `optimize_conversation[_with_tools]`

Testing

Unit tests live inline in each module (#[cfg(test)] mod tests). Integration tests are in crates/ai_tokenopt/tests/. Property-based tests use proptest.

cargo test -p ai_tokenopt             # unit + integration tests
cargo test -p ai_tokenopt --features pisovereign  # with domain types

Tip: When constructing ChatMessage::tool(...) in tests, use the make_tool_msg() / tool_msg() cfg-conditional helper defined in the test modules — the pisovereign feature changes the constructor signature.

PiSovereign Documentation