Small Model Forensics
2026-05-13 · 0xmmo

Claude and I made 2,000 API calls to nine small closed-weight models across three providers in a range of prompt sizes between 100 and 1M tokens. We ended up discovering some interesting things about how providers scale inference, or fail to, in 2026.

01Fastest model at 1 KB context is the slowest at 1 MB

People quote first-token latency numbers as if a model has a latency. It has a curve, and the curves cross. Choose wisely.

fig. 1TTFT floor vs prompt sizelog-log · min per byte bucket

gpt-4.1-nano wins on tiny/sub-second queries, gemini-3.1-flash-lite wins on large queries >600KB (~150K tokens).


tab. 1TTFT floor rankings at representative byte sizes (lower is faster).
model tiny (<1 KB) 64 KB 256 KB 601 KB 1 MB
gpt-4.1-nano #1 (176 ms) #2 (359) #2 (882) #3 (1,730) #5 (4,876)
gpt-5.4-mini #2 (233) #1 (349) #1 (779) #2 (1,376) #4 (3,353)
gemini-2.5-flash #5 (289) #3 (590) #6 (2,290) #5 (2,482) #6 (4,650)
claude-haiku-4-5 #7 (391) #4 (608) #4 (1,241) #4 (2,229) rejected
gemini-3.1-flash-lite #8 (461) #5 (733) #3 (1,181) #1 (912) #1 (1,732)
gemini-3-flash-preview #9 (492) #6 (911) #5 (1,268) #6 (2,127) #3 (2,861)

02Every model's prefill is nowhere near O(n²)

Plot the minimum first-token latency (excluding network) against input size, log-log. Textbook dense attention models with O(n²) prefill would give you a line with slope near 1 at large n; every doubling of context should cost roughly double the prefill wall time. What we actually see is a much flatter curve. Even with >100K contexts where opaque provider overheads become negligible, prefill still scales sub-linearly.

fig. 2 TTFT floor vs context — seven models, four orders of magnitude log-log · min over repeats

Each line is the per-cell minimum of first_content_delta_ms across prompt shapes. gemini-3.1-flash-lite stays under 5 s at ~870 k input tokens; gpt-4.1-nano exceeds 23 s at half that context. None of the seven curves bends like a quadratic.

Fit a power law C · n^α to the floor of each curve and read off the exponent.

tab. 2Fitted scaling exponent of min TTFT, by model.
model α (all data) α (≥ 10 k tok) character
gemini-3.1-flash-lite 0.15 0.29 remarkably flat
claude-haiku-4-5 0.19 0.58 smooth, sub-linear
gpt-5.4-mini 0.28 0.69 smooth, sub-linear
gemini-2.5-flash 0.34 0.70 smooth, sub-linear
gemini-3-flash-preview 0.31 0.73 sub-linear, late break
gpt-4o-mini 0.34 0.84 step-laden
gpt-4.1-nano 0.40 1.02 linear-or-worse at top end

Gemini 3.1 Flash Lite walks from 204 input tokens to 866 k input tokens — a factor of 4,200 in context — for only 0.7s → 5s in wall time. Seven times more latency for four thousand times more context. GPT-4.1 Nano exceeds 23s at half that context.

03Providers are doing decode very differently

The textbook says decode-per-token should be near flat or rises a touch as the prefix grows. We saw models with decode costs that rise significantly or even even fall.

fig. 3decode ms per output token vs contextlinear y · mean per cell · all prompt shapes
tab. 3Decode ms/token at representative prompt sizes
model tiny 128 KB 256 KB 601 KB 1 MB
gemini-3.1-flash-lite 4.6 4.7 4.8 3.3 3.3
gemini-3-flash-preview 7.1 8.5 8.0 12.5 11.7
claude-haiku-4-5 11.8 11.1 11.0 12.5
gpt-5.4-mini 7.1 9.1 8.9 31.8 108.4
gpt-4.1-nano 14.1 12.0 18.3 17.7 67.5
gpt-4o-mini 18.5 31.9 34.6

04Gemini Flash Lite goes the wrong way

A query with 144 k input tokens is faster than one with 62 k by a good margin. 2.3x more input tokens will give you a 1.5x faster response. Reproducibly!

fig. 4Gemini Lite negative-scaling zone, TTFT and total latencysong-lyrics-prompt · min per token bucket

Breaking it down by stage, both inferred prefill and decode times drop around the same threshold. The simplest story is a routing transition to different hardware somewhere near the 100 k-token band.

fig. 5Gemini Lite decode cost vs input tokenssong-lyrics-prompt · median · p10–p90 bars
fig. 6prefill throughput — KB / sec at the latency floorlog-log · prompt bytes / min TTFT

It's almost like Google rewards and OpenAI punishes large context sizes.

05Tokens are not created equal

Ending on a practical note. Each provider uses the same tokenizer across all models tested. Across providers tokens are not apples-to-apples: going from OpenAI to Anthropic will cost you an additional 14% in tokens that most don't account for in their math.

fig. 7bytes per token by provider family and content typeavg at > 100 KB prompt

As expected, 4 chars per token is a good estimate for English text. With our random prompt (hex-encoded random bytes), the ratio was closer to 1, so keep that in mind if you're sending high-entropy content.