Fast answer by device

Phone, browser, edge device

Start with E2B or E4B. These are the small models with 128K context and native audio emphasis, and they are the official mobile and edge line.

Laptop or small desktop

Start with E4B if you want the fastest local success path. Move to 26B A4B only when you can tolerate more memory pressure for stronger reasoning.

Consumer GPU

26B A4B is the main balancing point. It is the official “good place to start” model for many tasks and the most realistic step up from E4B.

Workstation or server

31B is the quality-first dense model. Use it when you are optimizing for coding, reasoning depth, and long-context work instead of low setup cost.

Official Gemma 4 inference memory requirements
Model BF16 SFP8 Q4_0 Best first interpretation
E2B 9.6 GB 4.6 GB 3.2 GB Smallest edge-first path when you care most about reach and responsiveness.
E4B 15 GB 7.5 GB 5 GB Safest local starting point for many laptop-class and quick-evaluation setups.
31B 58.3 GB 30.4 GB 17.4 GB Dense flagship for quality-first local reasoning, coding, and fine-tuning.
26B A4B 48 GB 25 GB 15.6 GB Officially recommended good place to start for many tasks when you want stronger output without jumping straight to 31B.
Read the memory table correctly

These numbers cover the static model weights only. The official docs also warn that context window, KV cache, runtime overhead, and fine-tuning all push actual VRAM usage higher.

Decision rules that hold up in practice

Choose E2B when device reach comes first

Use E2B when the real constraint is not prestige but whether the model can run at all on mobile, edge, or browser-oriented hardware.

Choose E4B when you want the quickest local win

E4B is the easiest model to recommend to people who want to get Gemma 4 running today in LM Studio or Ollama and then decide whether to scale up.

Choose 26B A4B when you want the real Gemma 4 step-up

The Gemma Get Started guide says 26B A4B is a good place to start for many tasks. That makes it the default recommendation for users who can afford the memory and want a serious first impression.

Choose 31B when output quality is the priority

31B is the better fit when you are optimizing for stronger reasoning, coding assistance, and long-context work rather than lighter hardware fit.

Choose by workflow, not by ego

If you mainly want quick local help, a smaller model you actually use every day beats a bigger model you only tolerate in benchmarks.

Quantization changes the answer

The same model can move from impossible to practical once you choose a lighter precision. That is why the Q4_0 column matters as much as the BF16 column.

Best runtime by model

E2B / E4B

Best for AI Edge, mobile experiments, LM Studio, and fast Ollama runs. These are the variants to reach for when your priority is low friction and local responsiveness.

26B A4B

Best for Ollama, hosted testing in AI Studio, and stronger local workflows on consumer GPUs or small servers. This is the model to test first when you want more than a toy setup.

31B

Best for workstations, hosted evaluation, and serious coding or document workflows where quality is the explicit goal and you can pay the memory cost.

Gemini API access path

If you want hosted evaluation before local deployment, use AI Studio or the Gemini API path first, then come back to the local runtimes once you know which model quality tier you need.

Read next