request sent →
first output token received
Inference API performance · TMLS 2026
Performance optimizations, and how your application can unlock them.
A controlled, recorded run
The thesis
The LLM provider builds the engine.
Your application defines the workload.
Performance is what happens when they meet.
Measure what matters · 1
TTFT measures request sent to first output token. Generation speed measures output tokens per second. API end-to-end measures request sent to the last response token. Application end-to-end measures user action to final usable result.
request sent →
first output token received
rate at which
output tokens are produced
request sent →
last response token received
user action →
final usable result rendered
Measure what matters · 2
A request moves from the application through connectivity, admission, orchestration, prompt processing, and decode. Generated tokens then return to the application.
Measure what matters · 3
The application prepares a request that crosses connectivity, admission, orchestration, prompt processing, and decode. The first returned token reveals an illustrative TTFT of 272 milliseconds. Repeated tokens animate a generation-speed gauge between 800 and 1000 tokens per second before settling at 900. The last response token reveals an illustrative API end-to-end latency of 2.3 seconds. One second later, application validation, processing, and rendering reveal an illustrative application end-to-end latency of 3.5 seconds.
Performance numbers are illustrative
Measure what matters · 4
The p50 marker represents a typical request. The p95 marker represents the slow tail, and the p99 marker represents the extreme tail.
Monitor both typical and tail performance. Both matter!
The playbook
01 Route intelligently
Broader model capability often carries a speed and cost premium.
Five models appear in order from lower to higher Artificial Analysis Intelligence Index. A fitted curve shows the typical tradeoff in this snapshot: higher-quality models generate more slowly.
Artificial Analysis model comparison · methodology · snapshot: June 13, 2026
01 Route intelligently
Repeat the decision for every task and every model call.
The same five-model benchmark chart now includes an example task quality bar at index 45. The model and fitted-curve segment below the bar are dimmed. DeepSeek V4 Flash is highlighted as the fastest model in this snapshot that clears the example bar.
Artificial Analysis model comparison · methodology · snapshot: June 13, 2026
01 Route intelligently
When both models qualify, the faster API transforms the user experience.
In an illustrative task where both models meet the quality bar, ten sequential calls each use ten thousand input tokens, generate five hundred output tokens, and include time to first token. Gpt-oss-120b on Cerebras, with 1.62 seconds time to first token and 1,823 tokens per second, takes about nineteen seconds total. GPT-5.5, with 75.1 seconds time to first token and 60 tokens per second, takes about thirteen minutes and fifty-four seconds total. The same agentic task completes about forty-four times faster on the Cerebras route.
gpt-oss-120b provider data · GPT-5.5 data · Artificial Analysis snapshot: June 13, 2026 · scenario assumptions are illustrative
02 Shorten the wire
Connection setup is costly.
In an illustrative request with a 40 millisecond round-trip time, a DNS cache miss takes 20 milliseconds, the TCP handshake takes 40 milliseconds, and full TLS 1.3 negotiation takes 40 milliseconds before the variable request upload.
02 Shorten the wire
Create one client. Reuse its connection pool.
The first request pays for DNS, TCP, TLS, and request upload. It then shifts into the blue transport palette but remains visible. A later request appears below and reuses the established connection, so it can begin with request upload. Developers should create one long-lived client per process, avoid recreating it for every call, and prefer HTTP/2 when supported.
Install: pip install "httpx[http2]" openai
from openai import OpenAI, DefaultHttpxClientllm = OpenAI(http_client=DefaultHttpxClient(http2=True))
02 Shorten the wire
Same prompt. Far fewer bytes. Lower latency.
A Cerebras benchmark uses a code-review request with approximately 30 thousand prompt tokens, up to 1,024 output tokens, and Llama 3.1 8B. MessagePack binary encoding followed by gzip reduces the request from 123.4 kilobytes to 2.0 kilobytes. TTFT p50 and p90 improve from 0.63 and 0.70 seconds to 0.43 and 0.49 seconds. API E2E p50 and p90 improve from 0.76 and 0.82 seconds to 0.58 and 0.63 seconds.
…-token code review · max … · … · Benchmarked on Cerebras Inference
import gzip, msgpack
encoded = msgpack.packb(payload)
body = gzip.compress(encoded)
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/vnd.msgpack",
"Content-Encoding": "gzip",
}
response = client.post(
CEREBRAS_CHAT_COMPLETIONS_URL,
content=body, headers=headers,
)
03 Send less; reuse more
Context grows every turn. Prompt processing grows with it. Curate continuously.
The slide is divided evenly between accumulating context on the left and curating context on the right. Three successive API requests add files, tools, tool outputs, and messages to the accumulating context without removing anything. A conceptual chart below adds one higher time-to-first-token point with each request. The right side remains hidden until the presenter advances. It then shows time to first token rising modestly as context grows and falling after each continuous curation pass. The final state labels unbounded accumulation as the approach to avoid and continuous curation as the preferred approach.
As context grows, so does TTFT
Curate continuously to keep TTFT in check.
03 Send less; reuse more
Stable prefix. Less prompt processing. Lower TTFT.
A first request processes and stores a stable prompt prefix in a provider-managed cache. A later request with the same prefix reuses that work, processes only its variable tail, and has lower time to first token. A request with an early mismatch must process the prefix again and has higher time to first token. On presenter advance, three practices explain how to structure stable prefixes, keep repeated prefixes identical, and monitor cache hit rate from cached token usage.
OpenAI prompt caching · Cerebras prompt caching · Cerebras cache usage field
04 Stop sooner
Design for the shortest useful answer.
04 Stop sooner
“After carefully considering the details, I believe the most likely category would be…”
parse ambiguity · extra tokens · retries{
"category": "billing",
"confidence": 0.94
}
Bounded output → predictable termination → fewer retries.
05 Separate traffic by urgency
Interactive, asynchronous, and offline requests are separated into different queues or service tiers.
Use priority, reserved capacity, or batch modes only where the provider exposes them.
OpenAI Batch · Anthropic Message Batches · Cerebras Batch (Private Preview)
Compound gains
An optimized request has shorter application, wire, queue, prefill, decode, and response spans than the baseline.
Illustrative trace, not a universal benchmark.
Smaller. Closer. Shorter. Better routed. Measured end to end.
Monday morning
You do not control the provider’s engine.
You control whether your application fights it or lets it fly.