| Variant | MMLU-Pro | GPQA Diamond | Aider Polyglot | LMArena Text |
|---|---|---|---|---|
| Gemma 4 31B IT Thinking | 77.3 | 65.0 | 74.9 | 1452 |
| Gemma 4 26B A4B IT Thinking | 75.2 | 62.5 | 68.6 | 1441 |
| Gemma 4 4B IT Thinking | 58.1 | 40.9 | 37.3 | 1337 |
What these numbers suggest
The larger models are the headline
The published results make it clear that the strongest benchmark story sits with the larger end of the family, especially 31B and 26B A4B.
Efficiency is part of the pitch
The attention was not only about absolute quality. It was about how much performance Google says it can deliver relative to model size and practical local use.
Smaller models still matter
The lower-size variants are not trying to win the same conversation. Their role is different: they make the family more accessible and more device-friendly.
What benchmarks do not settle
They do not fully measure fit
A model can post strong benchmark scores and still be the wrong choice for your device, your patience, or your day-to-day tasks.
They do not remove trade-offs
Benchmarks rarely tell a reader how comfortable a model feels, how stable it feels across mixed tasks, or how much setup burden it adds to normal life.
If the numbers catch your attention, that is fair. They are a reason to take Gemma 4 seriously, not a reason to assume the largest model is automatically the right fit for you.
How to use this page
- Use it to understand why Gemma 4 entered the conversation so quickly.
- Use it to separate the family’s strongest performance story from the simpler edge-oriented story.
- Use it as context before reading the comparison pages, not as the final answer by itself.