Docs/How-To Guides/Choose Routing Modes

>_ DOCS / HOW-TO GUIDES

CHOOSE
ROUTING MODE.

Every P402 request routes to a different provider depending on your priority: cost, quality, speed, or a balanced blend of all three.

How to Set a Mode

Pass mode in the p402 extension block. It applies to that request only. Omitting it defaults to cost.

bash
curl -s -X POST https://p402.io/api/v2/chat/completions \
  -H "Authorization: Bearer $P402_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Review this contract clause..."}],
    "p402": {
      "mode": "quality",
      "cache": true,
      "session_id": "sess_01jx..."
    }
  }' | jq .

The Four Modes

costDEFAULT

Selects the provider with the lowest per-token price for the request complexity. Combined with semantic caching, repeated queries cost $0.

Best for
  • High-volume batch processing
  • Classification and extraction
  • Summarisation pipelines
  • Budget agents with hard caps
Avoid for
  • Legal or medical analysis requiring highest accuracy
  • Nuanced creative writing
Typical savings
70–95% vs direct GPT-4o
Example provider
DeepSeek V3 (≈ $0.27/M tokens)
quality

Routes to the provider with the highest ELO benchmark score for the task type. Prefers frontier models: GPT-4o, Claude Opus, Gemini 1.5 Pro.

Best for
  • Complex reasoning and math
  • Code generation and review
  • Legal/medical/financial analysis
  • Customer-facing chat that must be right
Avoid for
  • High-volume low-stakes tasks (use cost mode)
Typical savings
10–30% vs paying OpenAI directly
Example provider
GPT-4o or Claude Opus (auto-selected)
speed

Selects the provider with the lowest current p95 latency. Prefers inference-optimised providers: Groq, Together, Fireworks.

Best for
  • Real-time chat interfaces
  • Streaming user-facing responses
  • Time-sensitive agentic loops
  • Interactive coding assistants
Avoid for
  • Batch jobs where latency doesn't matter
Typical savings
Up to 10× faster than standard providers
Example provider
Groq (Llama 3 70B, ≈ 300 tok/s)
balanced

Equal (1/3) weight on cost, quality, and latency scores. A good default for mixed workloads where no single factor dominates.

Best for
  • General-purpose agents
  • Mixed workloads
  • Prototyping before tuning
Avoid for
  • Workloads with a clear primary constraint — use the specific mode instead
Typical savings
40–60% vs direct GPT-4o
Example provider
Varies by request

Quick Decision Guide

If your workload is…
Use mode
Processing 10,000 documents overnight
cost
Answering complex legal or medical questions
quality
Streaming real-time chat to a user
speed
General-purpose assistant with no specific constraint
balanced
Classifying sentiment at scale
cost
Generating production code that will be deployed
quality
Autocomplete with < 300ms target latency
speed