Fast answer by device
Phone, browser, edge device
Start with E2B or E4B. These are the small models with 128K context and native audio emphasis, and they are the official mobile and edge line.
Laptop or small desktop
Start with E4B if you want the fastest local success path. Move to 26B A4B only when you can tolerate more memory pressure for stronger reasoning.
Consumer GPU
26B A4B is the main balancing point. It is the official “good place to start” model for many tasks and the most realistic step up from E4B.
Workstation or server
31B is the quality-first dense model. Use it when you are optimizing for coding, reasoning depth, and long-context work instead of low setup cost.
| Model | BF16 | SFP8 | Q4_0 | Best first interpretation |
|---|---|---|---|---|
| E2B | 9.6 GB | 4.6 GB | 3.2 GB | Smallest edge-first path when you care most about reach and responsiveness. |
| E4B | 15 GB | 7.5 GB | 5 GB | Safest local starting point for many laptop-class and quick-evaluation setups. |
| 31B | 58.3 GB | 30.4 GB | 17.4 GB | Dense flagship for quality-first local reasoning, coding, and fine-tuning. |
| 26B A4B | 48 GB | 25 GB | 15.6 GB | Officially recommended good place to start for many tasks when you want stronger output without jumping straight to 31B. |
These numbers cover the static model weights only. The official docs also warn that context window, KV cache, runtime overhead, and fine-tuning all push actual VRAM usage higher.
Decision rules that hold up in practice
Choose E2B when device reach comes first
Use E2B when the real constraint is not prestige but whether the model can run at all on mobile, edge, or browser-oriented hardware.
Choose E4B when you want the quickest local win
E4B is the easiest model to recommend to people who want to get Gemma 4 running today in LM Studio or Ollama and then decide whether to scale up.
Choose 26B A4B when you want the real Gemma 4 step-up
The Gemma Get Started guide says 26B A4B is a good place to start for many tasks. That makes it the default recommendation for users who can afford the memory and want a serious first impression.
Choose 31B when output quality is the priority
31B is the better fit when you are optimizing for stronger reasoning, coding assistance, and long-context work rather than lighter hardware fit.
Choose by workflow, not by ego
If you mainly want quick local help, a smaller model you actually use every day beats a bigger model you only tolerate in benchmarks.
Quantization changes the answer
The same model can move from impossible to practical once you choose a lighter precision. That is why the Q4_0 column matters as much as the BF16 column.
Best runtime by model
E2B / E4B
Best for AI Edge, mobile experiments, LM Studio, and fast Ollama runs. These are the variants to reach for when your priority is low friction and local responsiveness.
26B A4B
Best for Ollama, hosted testing in AI Studio, and stronger local workflows on consumer GPUs or small servers. This is the model to test first when you want more than a toy setup.
31B
Best for workstations, hosted evaluation, and serious coding or document workflows where quality is the explicit goal and you can pay the memory cost.
Gemini API access path
If you want hosted evaluation before local deployment, use AI Studio or the Gemini API path first, then come back to the local runtimes once you know which model quality tier you need.