BUILT WITH OPENAI CODEX POWERED BY CEREBRAS INFERENCE

Inference API performance · TMLS 2026

Squeezing more
juice out of
your inference API.

Application-level levers for faster inference.

Speed is a product feature

Two coding agents, same task. Go!

OpenAI, “Introducing GPT‑5.3‑Codex‑Spark,” 2026

The performance partnership

The inference provider builds the engine.

Your AI application defines the workload.

Performance is what happens when they meet.

Measure what matters · 1

Four metrics.
One user experience.

TTFTmilliseconds

request sent →
first output token received

When can the app response begin?

Generation speedtokens / second

rate at which
output tokens are produced

How quickly is the response generated?

API E2Eseconds

request sent →
last response token received

When is the API call complete?

App E2Eseconds

user action →
final usable result rendered

When is the application task complete?

Measure what matters · 2

From user intent to generated tokens

Application Initiate the request
- UX
- state
- tools
- API call
Connectivity Reach the provider
- DNS
- TCP
- TLS
- edge
Admission Allow or reject
- validation
- authentication
- authorization
- limits
Orchestration Prepare and route
- tokenize
- prioritize
- route
- batch
Prompt
processing Build model state
- embeddings
- cache lookup
- prefill
Decode Generate response
- sample
- constrain
- emit tokens

Measure what matters · 3

One request.
Four measurements.

Measure what matters · 4

Monitor the middle.
Watch the tail.

p50Typical: half of requests finish faster.
p95Slow tail: 5% of requests take longer.
p99Extreme tail: 1% of requests take longer.

Monitor both typical and tail performance. Both matter!

The playbook

Four levers for
faster inference requests.

1Route
intelligently
2Shorten
the wire
3Send less;
reuse more
4Optimize
generation

01 Route intelligently

Quality and speed
often trade places.

Broader model capability often carries a speed and cost premium.

Artificial Analysis model comparison · methodology · snapshot: June 13, 2026

01 Route intelligently

Meet the task’s quality bar.
Then choose the fastest route.

Repeat the decision for every task and every model call.

Artificial Analysis model comparison · methodology · snapshot: June 13, 2026

01 Route intelligently

A faster inference API
transforms the application.

When both models qualify, the faster API transforms the user experience.

≤ 33 task quality bar 10 sequential LLM calls 10,000 input tokens per call 500 output tokens per call

gpt-oss-120b · CerebrasTTFT 1.62 s · 1,823 TPS · 1.89 s/call

19s

GPT-5.5TTFT 75.1 s · 60 TPS · 83.43 s/call

13m 54s

44x faster completionSame agentic task

gpt-oss-120b provider data · GPT-5.5 data · Artificial Analysis snapshot: June 13, 2026 · scenario assumptions are illustrative

02 Shorten the wire

Before inference, the request has to get there.

Connection setup is costly.

Illustrative request Assuming … for every API roundtrip

DNS lookup… on a cache miss

…
TCP handshake… · typical range …

…
TLS negotiation… · typical range …

…
Request uploadPayload size + network bandwidth

Variable

Every fresh connection

DNS + TCP + TLS: …

DNS, RFC 1034 · TCP, RFC 9293 · TLS 1.3, RFC 8446

02 Shorten the wire

Pay connection setup once.

Create one client. Reuse its connection pool.

First request - connection setup paid once

DNS

TCP

TLS

Request upload

Later requests reuse the open connection - upload begins immediately

Request upload

01Create one long-lived LLM client per application process.
02Do not close and recreate the client after every API call.
03
Use HTTP/2 when supported by API provider.

MultiplexingConcurrent exchanges share one warm connection.

Binary framingCompact frames are efficient to parse and transmit.

Header compressionHPACK reduces repeated header bytes.

Using OpenAI Python API with HTTP/2process scope

Install: pip install "httpx[http2]" openai

from openai import OpenAI, DefaultHttpxClientllm = OpenAI(http_client=DefaultHttpxClient(http2=True))

HTTPX HTTP/2 guide

02 Shorten the wire

Large payloads drag down latency.
Encode compactly, then compress.

Same prompt. Far fewer bytes. Lower latency.

…-token code review · max … · … · Benchmarked on Cerebras Inference

JSON payload …

01 msgpack binary encoding

02 gzip compression

smaller payload …

Impact

…fewer bytes …faster P50 TTFT …faster P50 API E2E

Cerebras Inference example

import gzip, msgpack

encoded = msgpack.packb(payload)
body = gzip.compress(encoded)

headers = {
  "Authorization": f"Bearer {api_key}",
  "Content-Type": "application/vnd.msgpack",
  "Content-Encoding": "gzip",
}

response = client.post(
  CEREBRAS_CHAT_COMPLETIONS_URL,
  content=body, headers=headers,
)

Cerebras payload-compression benchmark

03 Send less; reuse more

Every input token must earn its place.

Context grows every turn. Prompt processing grows with it. Curate continuously.

Accumulating context

✕ Unbounded accumulation

system prompt task instructions file 1 search tool user message 1 file 2 file 3 search tool output database tool user message 2 file 4 file 5 database tool output task state user message 3

As context grows, so does TTFT

Curating context

✓ Continuous curation

Curate continuously: Summarize, Deduplicate, Trim

Curate continuously to keep TTFT in check.

Anthropic, Effective context engineering for AI agents

03 Send less; reuse more

Maximize your prompt
cache hit rate.

Stable prefix. Less prompt processing. Lower TTFT.

OpenAI prompt caching

04 Optimize generation

generation time≈ total generated tokens generation speed

Speculative decoding.

A small, fast model guesses ahead. The target model remains authoritative.

Step 1Draft model: small, fast, races ahead

Step 2Target model: large, slower, verifies draft in parallel

Matching prefix tokens are acceptedAt the first rejection, target model takes over.

Higher accepted tokens per step→higher gen speed

Up to 4× speedup (varies by algorithm, workload, and hardware)

The provider handles it. Your API stays the same.

04 Optimize generation

Lower Temperature

Make the prediction more predictable. Make the draft model succeed more often.

temperature is a sampling parameter supported by most API providers.

It rescales next-token probabilities:

Lower values make likely tokens more dominant.
Higher values allow more variation.

temperature ↓→ predictability ↑→ acceptance length ↑→ gen speed ↑

Code generation benchmark: gpt-oss-120b, Cerebras Inferencep50

Temperature 0.0 boosted gen speed by …

Unless you need the model to be “creative”,lower temperature is likely to improve gen speed!

Temperature explained by Vellum

04 Optimize generation

Predicted Outputs

Faster responses when most of the output is already known.

prediction reduces latency by specifying parts of the response that are already known.

Use it when the output is mostly known in advance.

Edit a file with a small diff
Revise a document while preserving most text
Regenerate a mostly known artifact

Code example for supported API providers

instructions = "Change the button color to blue."
html_code = ...

response = client.chat.completions.create(
  model="gpt-oss-120b",
  messages=[
    {"role": "user", "content": instructions},
    {"role": "user", "content": html_code},
  ],
  prediction={"type": "content", "content": html_code},
)

Predicted Outputs applied to the right tasks can increase gen speed by3x

Cerebras Predicted Outputs

04 Optimize generation

generation time≈ total generated tokens generation speed

Structured Outputs

Generate only the data your application needs.

response_format: json_schema constrains the response to a predefined shape, removing explanatory prose and bounding the output.

Use it when…

Output has a predetermined shape
Downstream code consume the output as data
Task is extraction, classification, function dispatch, or similar

compact schema→less prose→fewer output tokens

Example use case: order extraction

“Extract the order details from this customer message.”

order_idcustomertotal_usd itemsshipping_addressplaced_date

Raw text~75 tokens

Looking at the message,
the order details are:
Order ID: 4477
Customer: Jane Smith
Total: $147.50 · Items: 3
Shipping: 742 Evergreen Terrace,
Springfield OR
Placed: March 14, 2026.

JSON schema~45 tokens

order_id: "4477",
customer: "Jane Smith",
total_usd: 147.50,
items: 3,
shipping_address: "742 Evergreen...",
placed_date: "2026-03-14"

40% fewer tokens generated

No schema-repair retries

Stable contract across model upgrades

Cerebras Structured Outputs

04 Optimize generation

Right-size Reasoning

Match reasoning effort to task complexity.

`reasoning_effort`

Controls how much reasoning work the model performs before answering.

LowExecution-heavy

Answer mostly exists in the input.

extract · classify · route · format

MediumMixed reasoning

Some inference, comparison, or light planning.

analyze · compare · reconcile constraints

HighReasoning-heavy

The answer requires difficult multi-step work.

plan · debug · prove · complex tool use

Token generation with reasoning effort: gpt-oss-120b, Cerebras Inference

High reasoning effort can cost … the tokens, impacting TTFT and API E2E

Use the appropriate effort for the task at hand to boost performance and reduce cost

Cerebras reasoning

End of talk

The playbook. Recap.

01Route intelligently

Choose the fastest model that clears the task’s quality bar.

02Shorten the wire

Reuse connections and shrink every payload.

03Send less; reuse more

Curate context and preserve cacheable prefixes.

04Optimize generation

Make output predictable and bound tokens and reasoning.

You do not control the provider’s inference engine.
But, you can shape the workload to let it perform at its best.

Thank you for your time!

fast-inference.ai www.linkedin.com/in/hagaylupesko/

Squeezing morejuice out ofyour inference API.

Two coding agents, same task. Go!

Four metrics.One user experience.

From user intent to generated tokens

One request.Four measurements.

Monitor the middle.Watch the tail.

Four levers forfaster inference requests.

Quality and speedoften trade places.

Meet the task’s quality bar.Then choose the fastest route.

A faster inference APItransforms the application.

Before inference, the request has to get there.

Pay connection setup once.

Large payloads drag down latency.Encode compactly, then compress.

Every input token must earn its place.

Accumulating context

Curating context

Maximize your promptcache hit rate.

Generate tokens faster.

Speculative decoding.

Lower temperature.

Lower Temperature

Predicted output.

Predicted Outputs

Use it when the output is mostly known in advance.

Code example for supported API providers

Structured outputs.

Structured Outputs

Use it when…

Right-size reasoning.

Right-size Reasoning

reasoning_effort

The playbook. Recap.

Squeezing more
juice out of
your inference API.

Four metrics.
One user experience.

One request.
Four measurements.

Monitor the middle.
Watch the tail.

Four levers for
faster inference requests.

Quality and speed
often trade places.

Meet the task’s quality bar.
Then choose the fastest route.

A faster inference API
transforms the application.

Large payloads drag down latency.
Encode compactly, then compress.

Maximize your prompt
cache hit rate.

`reasoning_effort`