>_ Product / AI Routing

One endpoint.
300+ models.

P402 is an OpenAI-compatible API that routes each request to the optimal model for your cost and quality target — automatically, continuously, across 300+ providers. Every response includes full cost metadata. No vendor lock-in.

Integration

One line to switch.

Before
const openai = new OpenAI({
baseURL: 'https://api.openai.com/v1'
});
After
const openai = new OpenAI({
baseURL: 'https://p402.io/api/v2',
apiKey: process.env.P402_API_KEY
});

All other code unchanged. Same interface. 300+ models.

Routing Modes

Four modes. One field.

Set p402.mode on any request. Provider selection, failover, and re-ranking happen automatically.

Cost

Minimize spend per token.

Scores providers by effective cost per 1k tokens. Prefers semantic cache hits. Falls back through DeepSeek → Haiku → Flash.

deepseek-chatclaude-haiku-4-6gemini-3.1-flash

Batch summarization, classification, structured extraction.

Quality

Maximize capability regardless of cost.

Routes to frontier models with the highest benchmark scores for the task type. Reasoning and multi-step tasks use the top-ranked model available.

claude-opus-4-6gpt-5.4gemini-3.1-ultra

Code audits, legal analysis, complex reasoning chains.

Speed

Minimize time-to-first-token.

Prefers models on dedicated inference hardware. Targets sub-200ms TTFB on short prompts. Flash and Groq LPU endpoints are prioritised.

gemini-3.1-flashclaude-haiku-4-6llama-3.3-70b-versatile

User-facing chat, real-time agents, low-latency pipelines.

Balanced

50/50 cost and capability score (default).

Weighs cost efficiency against benchmark quality. Continuously re-weighted by the intelligence layer every hour based on live routing outcomes.

claude-sonnet-4-6gpt-5.4gemini-3.1-pro

General-purpose agents, production workloads without a hard cost target.

// Set mode per-request via the p402 extension field
await openai.chat.completions.create({
model: 'gpt-5.4', // overridden by router
messages,
// @ts-ignore — p402 extension field
p402: {
mode: 'cost', // cost | quality | speed | balanced
cache: true,
session_id: 'sess_xyz',
}
});
Response

Full cost metadata on every response.

Every completion returns a p402_metadata block with the actual cost, provider selected, savings vs. direct access, tokens, and latency.

p402_metadata shape
request_id: string
tenant_id: string
provider: string
model: string // actual model used
cost_usd: number // what P402 charged
direct_cost: number // list price at provider
savings: number // direct_cost - cost_usd
input_tokens: number
output_tokens: number
latency_ms: number
provider_latency_ms: number
cached: boolean
routing_mode: string
Why it matters
cost_usd vs direct_cost

Shows exactly what you saved on this call vs. going direct to the provider. Compound this across millions of requests.

cached: true

A semantic cache hit costs nothing and returns in <50ms. The cache serves responses when a new prompt is >92% similar to a previous one.

savings

direct_cost minus cost_usd. Use this field to build live savings dashboards or per-session cost attribution.

provider_latency_ms

Separates P402 overhead from provider latency. P402 typically adds <10ms to the critical path.

Semantic Cache

Repeated queries cost nothing.

Every prompt is embedded and compared against the cache. If an incoming request is more than 92% similar to a cached response, P402 returns it instantly — no provider call, no cost, sub-50ms latency.

>92%
Similarity threshold

Vector cosine similarity required before a cache hit is returned.

<50ms
Cache hit latency

Response served from Redis. No model inference, no on-chain settlement.

$0.00
Cache hit cost

No provider call means no cost. Platform fee is also waived on cache hits.

Enable with p402: { cache: true } on any request. Check p402_metadata.cached in the response to confirm a hit.
Billing Guard

Six layers of spending protection.

Every request passes six enforcement layers before a model is called. Layers 1–4 are Redis-backed and fail-open if Redis is unavailable. Layer 5 is always enforced. Layer 6 uses atomic budget reservation.

01
Rate limit

1,000 req / hr per tenant. Redis token bucket.

02
Daily circuit breaker

$1,000 / day hard cap. Resets at UTC midnight.

03
Concurrency cap

Max 10 simultaneous in-flight requests per tenant.

04
Anomaly detection

Z-score ≥ 3.0σ above cost history. Soft block — logs and alerts without hard-stopping.

05
Per-request cap

$50 max per single request. Always enforced, fail-closed regardless of Redis availability.

06
Budget reservation

Atomic Redis lock before dispatch. 5-minute TTL. Finalises actual spend on completion.

Billing Guard errors return structured JSON with a code and a retry_after_ms hint on rate limit responses. See Controls → for mandate and policy-level spend governance.

Start routing in minutes.

Free tier. No credit card. One URL swap.