One endpoint.
300+ models.
P402 is an OpenAI-compatible API that routes each request to the optimal model for your cost and quality target — automatically, continuously, across 300+ providers. Every response includes full cost metadata. No vendor lock-in.
One line to switch.
All other code unchanged. Same interface. 300+ models.
Four modes. One field.
Set p402.mode on any request. Provider selection, failover, and re-ranking happen automatically.
Minimize spend per token.
Scores providers by effective cost per 1k tokens. Prefers semantic cache hits. Falls back through DeepSeek → Haiku → Flash.
deepseek-chatclaude-haiku-4-6gemini-3.1-flashBatch summarization, classification, structured extraction.
Maximize capability regardless of cost.
Routes to frontier models with the highest benchmark scores for the task type. Reasoning and multi-step tasks use the top-ranked model available.
claude-opus-4-6gpt-5.4gemini-3.1-ultraCode audits, legal analysis, complex reasoning chains.
Minimize time-to-first-token.
Prefers models on dedicated inference hardware. Targets sub-200ms TTFB on short prompts. Flash and Groq LPU endpoints are prioritised.
gemini-3.1-flashclaude-haiku-4-6llama-3.3-70b-versatileUser-facing chat, real-time agents, low-latency pipelines.
50/50 cost and capability score (default).
Weighs cost efficiency against benchmark quality. Continuously re-weighted by the intelligence layer every hour based on live routing outcomes.
claude-sonnet-4-6gpt-5.4gemini-3.1-proGeneral-purpose agents, production workloads without a hard cost target.
Full cost metadata on every response.
Every completion returns a p402_metadata block with the actual cost, provider selected, savings vs. direct access, tokens, and latency.
cost_usd vs direct_costShows exactly what you saved on this call vs. going direct to the provider. Compound this across millions of requests.
cached: trueA semantic cache hit costs nothing and returns in <50ms. The cache serves responses when a new prompt is >92% similar to a previous one.
savingsdirect_cost minus cost_usd. Use this field to build live savings dashboards or per-session cost attribution.
provider_latency_msSeparates P402 overhead from provider latency. P402 typically adds <10ms to the critical path.
Repeated queries cost nothing.
Every prompt is embedded and compared against the cache. If an incoming request is more than 92% similar to a cached response, P402 returns it instantly — no provider call, no cost, sub-50ms latency.
Vector cosine similarity required before a cache hit is returned.
Response served from Redis. No model inference, no on-chain settlement.
No provider call means no cost. Platform fee is also waived on cache hits.
p402: { cache: true } on any request. Check p402_metadata.cached in the response to confirm a hit.Six layers of spending protection.
Every request passes six enforcement layers before a model is called. Layers 1–4 are Redis-backed and fail-open if Redis is unavailable. Layer 5 is always enforced. Layer 6 uses atomic budget reservation.
1,000 req / hr per tenant. Redis token bucket.
$1,000 / day hard cap. Resets at UTC midnight.
Max 10 simultaneous in-flight requests per tenant.
Z-score ≥ 3.0σ above cost history. Soft block — logs and alerts without hard-stopping.
$50 max per single request. Always enforced, fail-closed regardless of Redis availability.
Atomic Redis lock before dispatch. 5-minute TTL. Finalises actual spend on completion.
code and a retry_after_ms hint on rate limit responses. See Controls → for mandate and policy-level spend governance.Start routing in minutes.
Free tier. No credit card. One URL swap.