Inference API performance · TMLS 2026

Squeezing more
juice out of
your inference API.

Performance optimizations, and how your application can unlock them.

Hagay Lupesko Cerebras June 18, 2026

A controlled, recorded run

Speed makes the product
interactive and engaging.

Both agents receive the same prompt. The fast agent completes the HTML snake game in seven seconds; the slow agent completes it in 43 seconds.

OpenAI, “Introducing GPT‑5.3‑Codex‑Spark,” 2026

The thesis

The LLM provider builds the engine.

Your application defines the workload.

Performance is what happens when they meet.

Measure what matters · 1

Four metrics.
One user experience.

Four metrics for measuring LLM application performance

TTFT measures request sent to first output token. Generation speed measures output tokens per second. API end-to-end measures request sent to the last response token. Application end-to-end measures user action to final usable result.

TTFTmilliseconds

request sent
first output token received

When can the app response begin?
Generation speedtokens / second

rate at which
output tokens are produced

How quickly is the response generated?
API E2Eseconds

request sent
last response token received

When is the API call complete?
App E2Eseconds

user action
final usable result rendered

When is the application task complete?

Measure what matters · 2

From user intent to generated tokens

End-to-end LLM application request path

A request moves from the application through connectivity, admission, orchestration, prompt processing, and decode. Generated tokens then return to the application.

  1. Application Initiate the request
    • UX
    • state
    • tools
    • API call
  2. Connectivity Reach the provider
    • DNS
    • TCP
    • TLS
    • edge
  3. Admission Allow or reject
    • validation
    • authentication
    • authorization
    • limits
  4. Orchestration Prepare and route
    • tokenize
    • prioritize
    • route
    • batch
  5. Prompt
    processing
    Build model state
    • embeddings
    • cache lookup
    • prefill
  6. Decode Generate response
    • sample
    • constrain
    • emit tokens

Measure what matters · 3

One request.
Four measurements.

Four performance metrics placed on one LLM application request

The application prepares a request that crosses connectivity, admission, orchestration, prompt processing, and decode. The first returned token reveals an illustrative TTFT of 272 milliseconds. Repeated tokens animate a generation-speed gauge between 800 and 1000 tokens per second before settling at 900. The last response token reveals an illustrative API end-to-end latency of 2.3 seconds. One second later, application validation, processing, and rendering reveal an illustrative application end-to-end latency of 3.5 seconds.

Measure what matters · 4

Monitor the middle.
Watch the tail.

Latency distribution showing typical and tail request experiences

The p50 marker represents a typical request. The p95 marker represents the slow tail, and the p99 marker represents the extreme tail.

  • p50Typical: half of requests finish faster.
  • p95Slow tail: 5% of requests take longer.
  • p99Extreme tail: 1% of requests take longer.

Monitor both typical and tail performance. Both matter!

The playbook

Five levers for
faster LLM requests.

  1. 1Route
    intelligently
  2. 2Shorten
    the wire
  3. 3Send less;
    reuse more
  4. 4Stop
    sooner
  5. 5Separate
    by urgency

01 Route intelligently

Quality and speed
often trade places.

Broader model capability often carries a speed and cost premium.

Model quality and generation speed benchmark scatter plot

Five models appear in order from lower to higher Artificial Analysis Intelligence Index. A fitted curve shows the typical tradeoff in this snapshot: higher-quality models generate more slowly.

Artificial Analysis model comparison · methodology · snapshot: June 13, 2026

01 Route intelligently

Meet the task’s quality bar.
Then choose the fastest route.

Repeat the decision for every task and every model call.

Selecting the fastest model that meets a task quality bar

The same five-model benchmark chart now includes an example task quality bar at index 45. The model and fitted-curve segment below the bar are dimmed. DeepSeek V4 Flash is highlighted as the fastest model in this snapshot that clears the example bar.

Artificial Analysis model comparison · methodology · snapshot: June 13, 2026

01 Route intelligently

A faster LLM API
transforms the application.

When both models qualify, the faster API transforms the user experience.

Total sequential API time for two qualifying models across ten calls

In an illustrative task where both models meet the quality bar, ten sequential calls each use ten thousand input tokens, generate five hundred output tokens, and include time to first token. Gpt-oss-120b on Cerebras, with 1.62 seconds time to first token and 1,823 tokens per second, takes about nineteen seconds total. GPT-5.5, with 75.1 seconds time to first token and 60 tokens per second, takes about thirteen minutes and fifty-four seconds total. The same agentic task completes about forty-four times faster on the Cerebras route.

≤ 33 task quality bar 10 sequential LLM calls 10,000 input tokens per call 500 output tokens per call
gpt-oss-120b · CerebrasTTFT 1.62 s · 1,823 TPS · 1.89 s/call
19s
GPT-5.5TTFT 75.1 s · 60 TPS · 83.43 s/call
13m 54s
44x faster completionSame agentic task

gpt-oss-120b provider data · GPT-5.5 data · Artificial Analysis snapshot: June 13, 2026 · scenario assumptions are illustrative

02 Shorten the wire

Before inference, the request has to get there.

Connection setup is costly.

Cold and warm HTTPS request waterfalls for an LLM API call

In an illustrative request with a 40 millisecond round-trip time, a DNS cache miss takes 20 milliseconds, the TCP handshake takes 40 milliseconds, and full TLS 1.3 negotiation takes 40 milliseconds before the variable request upload.

Illustrative request Assuming for every API roundtrip
  1. DNS lookup on a cache miss
  2. TCP handshake · typical range
  3. TLS negotiation · typical range
  4. Request uploadPayload size + network bandwidth
    Variable
Every fresh connection
DNS + TCP + TLS:

DNS, RFC 1034 · TCP, RFC 9293 · TLS 1.3, RFC 8446

02 Shorten the wire

Pay connection setup once.

Create one client. Reuse its connection pool.

Practical connection reuse guidance for LLM API clients

The first request pays for DNS, TCP, TLS, and request upload. It then shifts into the blue transport palette but remains visible. A later request appears below and reuses the established connection, so it can begin with request upload. Developers should create one long-lived client per process, avoid recreating it for every call, and prefer HTTP/2 when supported.

First request - connection setup paid once
DNS
TCP
TLS
Request upload
Later requests reuse the open connection - upload begins immediately
Request upload
  1. 01Create one long-lived LLM client per application process.
  2. 02Do not close and recreate the client after every API call.
  3. 03
    Use HTTP/2 when supported by API provider.
    MultiplexingConcurrent exchanges share one warm connection.
    Binary framingCompact frames are efficient to parse and transmit.
    Header compressionHPACK reduces repeated header bytes.
Using OpenAI Python API with HTTP/2process scope

Install: pip install "httpx[http2]" openai

from openai import OpenAI, DefaultHttpxClientllm = OpenAI(http_client=DefaultHttpxClient(http2=True))

HTTPX HTTP/2 guide

02 Shorten the wire

Large payloads drag down latency.
Encode compactly, then compress.

Same prompt. Far fewer bytes. Lower latency.

Cerebras request-payload compression benchmark

A Cerebras benchmark uses a code-review request with approximately 30 thousand prompt tokens, up to 1,024 output tokens, and Llama 3.1 8B. MessagePack binary encoding followed by gzip reduces the request from 123.4 kilobytes to 2.0 kilobytes. TTFT p50 and p90 improve from 0.63 and 0.70 seconds to 0.43 and 0.49 seconds. API E2E p50 and p90 improve from 0.76 and 0.82 seconds to 0.58 and 0.63 seconds.

-token code review · max · · Benchmarked on Cerebras Inference

JSON payload
01 msgpack binary encoding
02 gzip compression
smaller payload
Impact
fewer bytes faster P50 TTFT faster P50 API E2E
Cerebras Inference example
import gzip, msgpack

encoded = msgpack.packb(payload)
body = gzip.compress(encoded)

headers = {
  "Authorization": f"Bearer {api_key}",
  "Content-Type": "application/vnd.msgpack",
  "Content-Encoding": "gzip",
}

response = client.post(
  CEREBRAS_CHAT_COMPLETIONS_URL,
  content=body, headers=headers,
)

Cerebras payload-compression benchmark

03 Send less; reuse more

Every input token must earn its place.

Context grows every turn. Prompt processing grows with it. Curate continuously.

Continuous context curation keeps the active model context bounded

The slide is divided evenly between accumulating context on the left and curating context on the right. Three successive API requests add files, tools, tool outputs, and messages to the accumulating context without removing anything. A conceptual chart below adds one higher time-to-first-token point with each request. The right side remains hidden until the presenter advances. It then shows time to first token rising modestly as context grows and falling after each continuous curation pass. The final state labels unbounded accumulation as the approach to avoid and continuous curation as the preferred approach.

Accumulating context

✕ Unbounded accumulation
system prompt task instructions file 1 search tool user message 1 file 2 file 3 search tool output database tool user message 2 file 4 file 5 database tool output task state user message 3

As context grows, so does TTFT

Curating context

✓ Continuous curation
Curate continuously: Summarize, Deduplicate, Trim

Curate continuously to keep TTFT in check.

Anthropic, Effective context engineering for AI agents

03 Send less; reuse more

Maximize your prompt
cache hit rate.

Stable prefix. Less prompt processing. Lower TTFT.

An identical prompt prefix creates a cache hit while an early mismatch creates a cache miss

A first request processes and stores a stable prompt prefix in a provider-managed cache. A later request with the same prefix reuses that work, processes only its variable tail, and has lower time to first token. A request with an early mismatch must process the prefix again and has higher time to first token. On presenter advance, three practices explain how to structure stable prefixes, keep repeated prefixes identical, and monitor cache hit rate from cached token usage.

Original prompt Prompt processed and cached PROVIDER MANAGED prefix match Cache hit · Lower TTFT × Cache miss · Higher TTFT

OpenAI prompt caching · Cerebras prompt caching · Cerebras cache usage field

04 Stop sooner

Decode is paid
sequentially.

total ≈ TTFT +output tokens÷generation rate
Short and long completions at the same generation rate
80 tokensuseful answer
800 tokenssame rate, 10× the work

Design for the shortest useful answer.

04 Stop sooner

Ask for the artifact
the product needs.

Bounded output → predictable termination → fewer retries.

OpenAI structured outputs guide

05 Separate traffic by urgency

Not every request deserves
the same latency budget.

Traffic classes mapped to capacity lanes

Interactive, asynchronous, and offline requests are separated into different queues or service tiers.

interactive
reserved headroomchat · copilots · active users
asynchronous
standard capacityuser-triggered background work
offline / batch
throughput optimizedevals · backfills · enrichment

Use priority, reserved capacity, or batch modes only where the provider exposes them.

OpenAI Batch · Anthropic Message Batches · Cerebras Batch (Private Preview)

Compound gains

Small cuts
compound.

Illustrative before and after distributed traces

An optimized request has shorter application, wire, queue, prefill, decode, and response spans than the baseline.

beforeappwirequeueprefilldecoderesponse3.8 s
afterappwirequeueprefilldecoderesponse1.5 s

Illustrative trace, not a universal benchmark.

Smaller. Closer. Shorter. Better routed. Measured end to end.

Monday morning

Measure.
Shape.
Verify.

  1. 01Instrument user → useful result.
  2. 02Set task quality bars. Pick the fastest qualifiers.
  3. 03Measure and shorten the wire.
  4. 04Prune inputs. Preserve prefixes. Bound outputs.
  5. 05Isolate urgent traffic. Verify p50 / p95 / p99.
You do not control the provider’s engine.
You control whether your application fights it or lets it fly.