>_ DOCS / HOW-TO GUIDES

CHOOSE
ROUTING MODE.

Every P402 request routes to a different provider depending on your priority: cost, quality, speed, or a balanced blend of all three.

How to Set a Mode

Pass mode in the p402 extension block. It applies to that request only. Omitting it defaults to cost.

bash

curl -s -X POST https://p402.io/api/v2/chat/completions \
  -H "Authorization: Bearer $P402_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Review this contract clause..."}],
    "p402": {
      "mode": "quality",
      "cache": true,
      "session_id": "sess_01jx..."
    }
  }' | jq .

The Four Modes

costDEFAULTCheapest model that answers the question.

Selects the provider with the lowest per-token price for the request complexity. Combined with semantic caching, repeated queries cost $0.

Best for

✓High-volume batch processing
✓Classification and extraction
✓Summarisation pipelines
✓Budget agents with hard caps

Avoid for

✗Legal or medical analysis requiring highest accuracy
✗Nuanced creative writing

Typical savings

70–95% vs direct GPT-4o

Example provider

DeepSeek V3 (≈ $0.27/M tokens)

qualityBest model for accuracy and reasoning.

Routes to the provider with the highest ELO benchmark score for the task type. Prefers frontier models: GPT-4o, Claude Opus, Gemini 1.5 Pro.

Best for

✓Complex reasoning and math
✓Code generation and review
✓Legal/medical/financial analysis
✓Customer-facing chat that must be right

Avoid for

✗High-volume low-stakes tasks (use cost mode)

Typical savings

10–30% vs paying OpenAI directly

Example provider

GPT-4o or Claude Opus (auto-selected)

speedLowest latency response.

Selects the provider with the lowest current p95 latency. Prefers inference-optimised providers: Groq, Together, Fireworks.

Best for

✓Real-time chat interfaces
✓Streaming user-facing responses
✓Time-sensitive agentic loops
✓Interactive coding assistants

Avoid for

✗Batch jobs where latency doesn't matter

Typical savings

Up to 10× faster than standard providers

Example provider

Groq (Llama 3 70B, ≈ 300 tok/s)

balancedEqual weight across cost, quality, and speed.

Equal (1/3) weight on cost, quality, and latency scores. A good default for mixed workloads where no single factor dominates.

Best for

✓General-purpose agents
✓Mixed workloads
✓Prototyping before tuning

Avoid for

✗Workloads with a clear primary constraint — use the specific mode instead

Typical savings

40–60% vs direct GPT-4o

Example provider

Varies by request

Quick Decision Guide

If your workload is…

Use mode

Processing 10,000 documents overnight

cost

Answering complex legal or medical questions

quality

Streaming real-time chat to a user

speed

General-purpose assistant with no specific constraint

balanced

Classifying sentiment at scale

cost

Generating production code that will be deployed

quality

Autocomplete with < 300ms target latency

speed

What's next

CHOOSEROUTING MODE.

How to Set a Mode

The Four Modes

Quick Decision Guide

CHOOSE
ROUTING MODE.