Selected benchmark signals from official launch materials
VariantMMLU-ProGPQA DiamondAider PolyglotLMArena Text
Gemma 4 31B IT Thinking77.365.074.91452
Gemma 4 26B A4B IT Thinking75.262.568.61441
Gemma 4 4B IT Thinking58.140.937.31337

What these numbers suggest

The larger models are the headline

The published results make it clear that the strongest benchmark story sits with the larger end of the family, especially 31B and 26B A4B.

Efficiency is part of the pitch

The attention was not only about absolute quality. It was about how much performance Google says it can deliver relative to model size and practical local use.

Smaller models still matter

The lower-size variants are not trying to win the same conversation. Their role is different: they make the family more accessible and more device-friendly.

What benchmarks do not settle

They do not fully measure fit

A model can post strong benchmark scores and still be the wrong choice for your device, your patience, or your day-to-day tasks.

They do not remove trade-offs

Benchmarks rarely tell a reader how comfortable a model feels, how stable it feels across mixed tasks, or how much setup burden it adds to normal life.

How to read the benchmark story

If the numbers catch your attention, that is fair. They are a reason to take Gemma 4 seriously, not a reason to assume the largest model is automatically the right fit for you.

How to use this page

  • Use it to understand why Gemma 4 entered the conversation so quickly.
  • Use it to separate the family’s strongest performance story from the simpler edge-oriented story.
  • Use it as context before reading the comparison pages, not as the final answer by itself.

Read next