>_ DOCS / EXPLANATION

THE ROUTING
ENGINE.

Every request passes through three stages in order: semantic cache lookup, provider scoring, and execution with automatic failover. This page explains exactly how each stage works.

The Three Stages

Semantic Cache

The request is embedded using Google text-embedding-004 and compared against recent responses stored in Redis. If cosine similarity exceeds 0.92, the cached response is returned immediately — no LLM call, no cost.

Provider Scoring

All healthy providers are scored against your chosen routing mode (cost, speed, quality, or balanced). The top scorer is selected as the primary candidate. Two backup candidates are retained for failover.

Execution + Failover

The primary provider is called. On any failure — rate limit, timeout, 5xx — the router automatically retries with the next-ranked provider. This happens transparently; your client sees one clean response.

Scoring Algorithm

Each provider candidate receives a composite score between 0 and 1. The weights shift depending on the routing mode you specify. A provider that is down or degraded is penalized before any mode-specific scoring.

Scoring Factors

Factor	Source	Description
success_rate	DB: facilitator_health	7-day rolling success ratio. A provider at 99% beats one at 95%.
p95_settle_ms	DB: facilitator_health	95th-percentile latency in ms. Weighted heavily in speed mode.
cost_per_1k_tokens	lib/ai-providers/registry.ts	Input + output token price. Primary factor in cost mode.
reputation_score	ERC-8004 on-chain	On-chain reputation from ERC-8004 registry. Normalized 0–1.
health_status	Live health probe	healthy=1.0, degraded=0.5, down=0. Applied as a multiplier.

Weight Distribution by Mode

Mode	Cost	Speed	Quality	Reliability
cost	70%	10%	10%	10%
speed	10%	70%	10%	10%
quality	10%	10%	70%	10%
balanced	25%	25%	25%	25%

Semantic Cache Detail

The cache is tenant-scoped — one tenant's responses never leak to another. The similarity threshold of 0.92 is the empirically calibrated default: high enough to block hallucination-risk false positives, low enough to catch paraphrased duplicates.

Embedding model

text-embedding-004 (Google)

Similarity metric

Cosine similarity

Default threshold

0.92 (configurable per request)

Storage backend

Redis with vector index

Scope

Per-tenant (never shared across accounts)

Cache hit latency

< 50ms (vs 1–10s for LLM call)

Cache hit cost

$0.00

TTL

24 hours (default)

Opt out of caching

To bypass the cache for a specific request (e.g., real-time data queries), set "cache": false in the p402 object. The request will still be routed normally but the response will not be stored.

Automatic Failover

The router holds a ranked list of three provider candidates for every request. If the top-ranked provider fails for any reason, the router immediately retries with the next candidate — no delay, no error surfaced to your client.

HTTP 429 (rate limit)

Immediate retry with candidate #2

HTTP 5xx (server error)

Immediate retry with candidate #2

Connection timeout (> 30s)

Abort + retry with candidate #2

All 3 candidates fail

Return HTTP 503 with PROVIDER_UNAVAILABLE

Budget exceeded mid-stream

Return HTTP 403 with BUDGET_EXCEEDED; no retry

Failover is transparent

The p402_metadata.provider field in the response tells you which provider actually served the request. If failover occurred, this will differ from what you might expect based on your mode. You can use this to diagnose provider degradation.

OpenRouter Meta-Provider

P402 treats OpenRouter as a single provider that proxies 300+ models. When your routing mode selects OpenRouter, the model within OpenRouter is further optimized based on your mode. This means you get access to every new frontier model the moment OpenRouter adds it — no adapter changes required on your end.

Model freshness

GPT-4.1, Claude 4.5, Gemini 2.0 Pro available the day they land on OpenRouter.

Unified billing

One OPENROUTER_API_KEY covers all 300+ models. P402 adds a transparent 1% routing fee on top.

Fallback depth

If your primary direct provider (e.g. Anthropic) fails, OpenRouter serves as a deep fallback pool.

Model pinning

Force a specific model with "model": "anthropic/claude-opus-4" in configuration.

Per-Request Configuration

The p402 object in your request body controls routing behavior for that request. All fields are optional; omitting them uses account defaults.

json — p402 configuration object

{
  "messages": [...],
  "p402": {
    "mode": "cost",          // "cost" | "speed" | "quality" | "balanced"
    "cache": true,           // true = use semantic cache (default: true)
    "session_id": "ses_...", // budget-capped session (optional)
    "max_cost_usd": 0.01,    // hard ceiling per request (optional)
    "provider": "anthropic", // force a specific provider (optional)
    "model": "claude-opus-4" // force a specific model (optional)
  }
}

mode

Selects the weight profile for provider scoring. Defaults to "balanced" if not set.

cache

Set false to skip semantic cache lookup and storage for this request. Useful for real-time or personalized responses.

session_id

Attaches this request to a budget-capped session. Requests that would exceed the session budget return HTTP 403 before the LLM is called.

max_cost_usd

A per-request cost ceiling. If the cheapest available provider exceeds this, you get COST_LIMIT_EXCEEDED rather than a surprise charge.

provider

Bypasses scoring entirely and routes to this specific provider. Use for compliance requirements or testing. Falls back to normal routing if the provider is down.

model

Forces a specific model. The provider that serves this model is selected automatically unless you also specify provider.

Routing Decision in the Response

Every response includes a p402_metadata field that tells you exactly what happened: which provider was selected, what it cost, and whether the response came from cache.

json — p402_metadata

{
  "p402_metadata": {
    "provider": "deepseek",        // Provider that served the request
    "model": "deepseek-v3",        // Model used
    "cost_usd": 0.0003,            // What you were charged
    "direct_cost": 0.0031,         // What GPT-4o would have cost
    "savings": 0.0028,             // Savings from intelligent routing
    "input_tokens": 24,
    "output_tokens": 187,
    "cached": false,               // true = served from semantic cache
    "latency_ms": 1240,            // Time from request to first token
    "mode": "cost",                // Mode used for this request
    "failover": false              // true = primary provider failed, used fallback
  }
}

Related docs

THE ROUTINGENGINE.