Claude and I made 2,000 API calls to nine small closed-weight models across three providers in a range of prompt sizes between 100 and 1M tokens. We ended up discovering some interesting things about how providers scale inference, or fail to, in 2026.
You can view all the measurements in the interactive viewer. Code and raw dataset on GitHub.
People quote first-token latency numbers as if a model has a latency. It has a curve, and the curves cross. Choose wisely.
gpt-4.1-nano wins on tiny/sub-second queries, gemini-3.1-flash-lite wins on large queries >600KB (~150K tokens).
| model | tiny (<1 KB) | 64 KB | 256 KB | 601 KB | 1 MB |
|---|---|---|---|---|---|
| gpt-4.1-nano | #1 (176 ms) | #2 (359) | #2 (882) | #3 (1,730) | #5 (4,876) |
| gpt-5.4-mini | #2 (233) | #1 (349) | #1 (779) | #2 (1,376) | #4 (3,353) |
| gemini-2.5-flash | #5 (289) | #3 (590) | #6 (2,290) | #5 (2,482) | #6 (4,650) |
| claude-haiku-4-5 | #7 (391) | #4 (608) | #4 (1,241) | #4 (2,229) | rejected |
| gemini-3.1-flash-lite | #8 (461) | #5 (733) | #3 (1,181) | #1 (912) | #1 (1,732) |
| gemini-3-flash-preview | #9 (492) | #6 (911) | #5 (1,268) | #6 (2,127) | #3 (2,861) |
Plot the minimum first-token latency (excluding network) against input
size, log-log. Textbook dense attention models with
O(n²) prefill would give you a line with slope near 1
at large n; every doubling of context should cost roughly
double the prefill wall time. What we actually see is a much flatter
curve. Even with >100K contexts where opaque provider overheads become
negligible, prefill still scales sub-linearly.
Each line is the per-cell minimum of
first_content_delta_ms across prompt shapes.
gemini-3.1-flash-lite stays under 5 s at
~870 k input tokens; gpt-4.1-nano exceeds
23 s
at half that context. None of the seven curves bends like a quadratic.
Fit a power law C · n^α to the floor of each
curve and read off the exponent.
| model | α (all data) | α (≥ 10 k tok) | character |
|---|---|---|---|
| gemini-3.1-flash-lite | 0.15 | 0.29 | remarkably flat |
| claude-haiku-4-5 | 0.19 | 0.58 | smooth, sub-linear |
| gpt-5.4-mini | 0.28 | 0.69 | smooth, sub-linear |
| gemini-2.5-flash | 0.34 | 0.70 | smooth, sub-linear |
| gemini-3-flash-preview | 0.31 | 0.73 | sub-linear, late break |
| gpt-4o-mini | 0.34 | 0.84 | step-laden |
| gpt-4.1-nano | 0.40 | 1.02 | linear-or-worse at top end |
Gemini 3.1 Flash Lite walks from 204 input tokens to
866 k
input tokens — a factor of 4,200 in context — for only
0.7s → 5s in wall time. Seven times more latency for
four thousand times more context. GPT-4.1 Nano exceeds
23s at half that context.
The textbook says decode-per-token should be near flat or rises a touch as the prefix grows. We saw models with decode costs that rise significantly or even even fall.
| model | tiny | 128 KB | 256 KB | 601 KB | 1 MB |
|---|---|---|---|---|---|
| gemini-3.1-flash-lite | 4.6 | 4.7 | 4.8 | 3.3 | 3.3 |
| gemini-3-flash-preview | 7.1 | 8.5 | 8.0 | 12.5 | 11.7 |
| claude-haiku-4-5 | 11.8 | 11.1 | 11.0 | 12.5 | — |
| gpt-5.4-mini | 7.1 | 9.1 | 8.9 | 31.8 | 108.4 |
| gpt-4.1-nano | 14.1 | 12.0 | 18.3 | 17.7 | 67.5 |
| gpt-4o-mini | 18.5 | 31.9 | 34.6 | — | — |
A query with 144 k input tokens is faster than one with
62 k by a good margin. 2.3x more input tokens will give you
a 1.5x faster response. Reproducibly!
Breaking it down by stage, both inferred prefill and decode times drop
around the same threshold. The simplest story is a routing transition to
different hardware somewhere near the 100 k-token band.
It's almost like Google rewards and OpenAI punishes large context sizes.
Ending on a practical note. Each provider uses the same tokenizer across all models tested. Across providers tokens are not apples-to-apples: going from OpenAI to Anthropic will cost you an additional 14% in tokens that most don't account for in their math.
As expected, 4 chars per token is a good estimate for English text. With
our
random prompt (hex-encoded random bytes), the ratio was
closer to 1, so keep that in mind if you're sending high-entropy
content.