Small Model Forensics

01Fastest model at 1 KB context is the slowest at 1 MB

People quote first-token latency numbers as if a model has a latency. It has a curve, and the curves cross. Choose wisely.

fig. 1TTFT floor vs prompt sizelog-log · min per byte bucket

gpt-4.1-nano wins on tiny/sub-second queries, gemini-3.1-flash-lite wins on large queries >600KB (~150K tokens).

tab. 1TTFT floor rankings at representative byte sizes (lower is faster).
model	tiny (<1 KB)	64 KB	256 KB	601 KB	1 MB
gpt-4.1-nano	#1 (176 ms)	#2 (359)	#2 (882)	#3 (1,730)	#5 (4,876)
gpt-5.4-mini	#2 (233)	#1 (349)	#1 (779)	#2 (1,376)	#4 (3,353)
gemini-2.5-flash	#5 (289)	#3 (590)	#6 (2,290)	#5 (2,482)	#6 (4,650)
claude-haiku-4-5	#7 (391)	#4 (608)	#4 (1,241)	#4 (2,229)	rejected
gemini-3.1-flash-lite	#8 (461)	#5 (733)	#3 (1,181)	#1 (912)	#1 (1,732)
gemini-3-flash-preview	#9 (492)	#6 (911)	#5 (1,268)	#6 (2,127)	#3 (2,861)

02Every model's prefill is nowhere near O(n²)

Plot the minimum first-token latency (excluding network) against input size, log-log. Textbook dense attention models with O(n²) prefill would give you a line with slope near 1 at large n; every doubling of context should cost roughly double the prefill wall time. What we actually see is a much flatter curve. Even with >100K contexts where opaque provider overheads become negligible, prefill still scales sub-linearly.

fig. 2 TTFT floor vs context — seven models, four orders of magnitude log-log · min over repeats

Each line is the per-cell minimum of first_content_delta_ms across prompt shapes. gemini-3.1-flash-lite stays under 5 s at ~870 k input tokens; gpt-4.1-nano exceeds 23 s at half that context. None of the seven curves bends like a quadratic.

Fit a power law C · n^α to the floor of each curve and read off the exponent.

tab. 2Fitted scaling exponent of `min TTFT`, by model.
model	α (all data)	α (≥ 10 k tok)	character
gemini-3.1-flash-lite	0.15	0.29	remarkably flat
claude-haiku-4-5	0.19	0.58	smooth, sub-linear
gpt-5.4-mini	0.28	0.69	smooth, sub-linear
gemini-2.5-flash	0.34	0.70	smooth, sub-linear
gemini-3-flash-preview	0.31	0.73	sub-linear, late break
gpt-4o-mini	0.34	0.84	step-laden
gpt-4.1-nano	0.40	1.02	linear-or-worse at top end

Gemini 3.1 Flash Lite walks from 204 input tokens to 866 k input tokens — a factor of 4,200 in context — for only 0.7s → 5s in wall time. Seven times more latency for four thousand times more context. GPT-4.1 Nano exceeds 23s at half that context.

03Providers are doing decode very differently

The textbook says decode-per-token should be near flat or rises a touch as the prefix grows. We saw models with decode costs that rise significantly or even even fall.

fig. 3decode ms per output token vs contextlinear y · mean per cell · all prompt shapes

tab. 3Decode ms/token at representative prompt sizes
model	tiny	128 KB	256 KB	601 KB	1 MB
gemini-3.1-flash-lite	4.6	4.7	4.8	3.3	3.3
gemini-3-flash-preview	7.1	8.5	8.0	12.5	11.7
claude-haiku-4-5	11.8	11.1	11.0	12.5	—
gpt-5.4-mini	7.1	9.1	8.9	31.8	108.4
gpt-4.1-nano	14.1	12.0	18.3	17.7	67.5
gpt-4o-mini	18.5	31.9	34.6	—	—

04Gemini Flash Lite goes the wrong way

A query with 144 k input tokens is faster than one with 62 k by a good margin. 2.3x more input tokens will give you a 1.5x faster response. Reproducibly!

fig. 4Gemini Lite negative-scaling zone, TTFT and total latencysong-lyrics-prompt · min per token bucket

Breaking it down by stage, both inferred prefill and decode times drop around the same threshold. The simplest story is a routing transition to different hardware somewhere near the 100 k-token band.

fig. 5Gemini Lite decode cost vs input tokenssong-lyrics-prompt · median · p10–p90 bars

fig. 6prefill throughput — KB / sec at the latency floorlog-log · prompt bytes / min TTFT

It's almost like Google rewards and OpenAI punishes large context sizes.

05Tokens are not created equal

Ending on a practical note. Each provider uses the same tokenizer across all models tested. Across providers tokens are not apples-to-apples: going from OpenAI to Anthropic will cost you an additional 14% in tokens that most don't account for in their math.

fig. 7bytes per token by provider family and content typeavg at > 100 KB prompt

As expected, 4 chars per token is a good estimate for English text. With our random prompt (hex-encoded random bytes), the ratio was closer to 1, so keep that in mind if you're sending high-entropy content.