The AI Runtime Field Lab

Field Briefs

AI Infrastructure Advanced Optimization Middleware Open Draft, pending editorial review

Right-Size Every Call: An AI Cost and Latency Router

Build middleware that routes each request between small and large models by predicted complexity, measuring the tradeoff across cost, latency, and quality.

Why this matters

Sending every request to the largest model is the most common source of avoidable AI spend, but routing naively to a smaller model drops quality on the requests that needed the large one. The value is in a router that protects a quality floor while cutting cost and latency, and proves the tradeoff with numbers.

Persona

Platform engineer running an LLM feature at scale

Current manual workflow

Every request goes to one large model regardless of difficulty, and cost and latency are reviewed monthly after the bill arrives, with no per-request complexity signal.

The AI workflow to build

The middleware scores each request for predicted complexity, routes simple requests to a small model and hard ones to a large model, and falls back to the large model when a small-model output fails a quality check. It logs cost, latency, and a quality score per request and reports the aggregate tradeoff against an all-large baseline.

Inputs

  • a request stream with mixed difficulty
  • two or more candidate models
  • a quality check or rubric
  • a quality floor configuration

Outputs

  • a per-request routing decision
  • cost, latency, and quality logs
  • an aggregate tradeoff report against the all-large baseline

Definition of done

On a synthetic request stream with labeled easy and hard items, the router cuts cost and latency against the all-large baseline while holding aggregate quality at or above the configured floor, and escalates failed small-model outputs to the large model rather than shipping them.

Example input

A batch of 1000 requests, 70 percent simple lookups and 30 percent multi-step reasoning, with an all-large baseline cost and quality recorded.

Example output

Report: cost down 58 percent, p95 latency down 40 percent, aggregate quality within 1 point of baseline, 4 percent of small-model outputs escalated to large on a failed quality check.

Data plan

synthetic data

Boundaries and non-goals

  • training a router model from scratch
  • real production traffic
  • provider billing integration

Evaluation ideas

  • cost and latency reduction against baseline
  • quality floor adherence
  • escalation precision
  • routing accuracy on labeled difficulty

Run Level target

R3 Reliable Plain translation: handles real cases.

Scope envelope

Buildable by one solo builder in 20 to 30 focused hours, on public, synthetic, or sanitized data, with a demo path that requires no production access.

Suggested tools

Suggested options, never requirements; briefs are tool-agnostic.