How Google's Gemma 4 MTP Drafters Deliver 3x Faster AI Inference (Without Quality Loss)

How Google's Gemma 4 MTP Drafters Deliver 3x Faster AI Inference (Without Quality Loss)

TL;DR: Google released Multi-Token Prediction (MTP) drafters for Gemma 4 on May 5, 2026. These lightweight companion models use speculative decoding to generate up to 3x more tokens per second on the same hardware โ€” with zero degradation in output quality. Available under Apache 2.0 on Hugging Face. Here's exactly how they work and what it means for developers running LLMs locally.

Two Weeks In, 60 Million Downloads โ€” Then This

On April 2, 2026, Google dropped Gemma 4 โ€” four models spanning from 2B to 31B parameters, designed to run everywhere from phones to workstations. The adoption was immediate. Within weeks, the model family crossed 60 million downloads across Hugging Face, Kaggle, and Ollama.

But there was always an elephant in the room: inference speed. Even with a well-optimized 31B model on a high-end consumer GPU, token generation is bottlenecked by memory bandwidth, not compute. You're spending most of your GPU cycles moving parameters from VRAM to compute units โ€” for each single token.

On May 5, Google announced a solution that doesn't require changing the model architecture, retraining, or buying better hardware: Multi-Token Prediction (MTP) drafters.

The Core Problem: LLMs Are Memory-Bound, Not Compute-Bound

To understand why MTP drafters matter, you need to understand what's actually slow about running an LLM.

Every time an LLM generates a token, it has to load billions of parameters from VRAM into the GPU's compute units. The GPU has no trouble doing the math โ€” the bottleneck is moving all that data across the memory bus. On a consumer card like an RTX 4090, you're looking at roughly 1 TB/s of memory bandwidth trying to feed a model that's 60+ GB in size. The math works out to roughly 60ms per token just for data movement.

This is called being memory-bandwidth bound, and it's why your GPU utilization often looks low during inference. The compute units are sitting idle, waiting for data.

What Is Speculative Decoding?

The key insight behind speculative decoding (introduced by Google researchers in their 2022 paper Fast Inference from Transformers via Speculative Decoding) is beautifully simple: use a tiny, fast model to propose multiple tokens, then have the big model verify them all in one pass.

Here's the pipeline:

  1. A small "drafter" model (the MTP model) generates a draft of 3-5 future tokens. This model is lightweight โ€” it runs in a fraction of the time the big model takes for a single token.
  2. The big "target" model (e.g., Gemma 4 31B) receives the draft and verifies all proposed tokens in a single forward pass.
  3. If the target agrees with the draft, it accepts the entire sequence โ€” and generates one more token of its own in the process.
  4. If it disagrees, it rejects the incorrect tokens and continues from the last valid one.

The magic? Because the verification happens in parallel (one forward pass handles the entire draft), you can output 4-5 tokens in roughly the same wall-clock time it normally takes to output just 1. That's how you get the 2-3x speedup.

What Makes Gemma 4's MTP Drafters Different?

Google has released MTP drafters for all four Gemma 4 variants โ€” E2B, E4B, 26B MoE, and 31B Dense. While the speculative decoding concept isn't new, the Gemma 4 implementation has several architectural innovations that make it unusually effective in practice.

Shared Activations and KV Cache

The drafters share the target model's KV cache โ€” the stored intermediate attention computations that make transformer inference efficient. Without this, the drafter would waste precious time recomputing context that the target model has already processed. By sharing, the drafter essentially gets a running start on every prediction step.

On the Gemma 4 31B dense model, the MTP drafter achieves a consistent 3x speedup on GPU hardware tested with LiteRT-LM, MLX, Hugging Face Transformers, and vLLM.

Clustering for Edge Models

The smallest Gemma 4 models โ€” E2B (2B parameters) and E4B (4B parameters) โ€” are designed to run on phones, Chromebooks, and edge devices. These are the kinds of devices where every millisecond matters.

Google implemented an efficient clustering technique in the embedder layer specifically for these models. This targets a bottleneck that's unique to edge hardware: the final logit calculation, where internal model representations get mapped to vocabulary probabilities. The vocabulary is large (tens of thousands of tokens), and on a phone's NPU, this step alone can dominate generation time. The clustering approach accelerates this final mapping, improving end-to-end generation speed where it matters most.

MoE Routing on Apple Silicon

The 26B Mixture-of-Experts model has unique routing characteristics on Apple Silicon. At a batch size of 1 (standard for interactive chat), performance is limited by how the MoE router distributes tokens across experts. But bumping the batch size to 4-8 unlocks up to 2.2x speedup on Apple Silicon hardware โ€” a meaningful gain for developers running Gemma 4 locally on a MacBook Pro.

Benchmarks: How Much Faster Is It?

Google published tokens-per-second measurements across several hardware configurations:

ModelWithout MTPWith MTP DrafterSpeedup
Gemma 4 2B (E2B)~45 tok/s~120 tok/s2.7x
Gemma 4 4B (E4B)~25 tok/s~70 tok/s2.8x
Gemma 4 26B MoE~15 tok/s~40 tok/s2.6x
Gemma 4 31B Dense~10 tok/s~30 tok/s3.0x

Tested on NVIDIA A100 (80GB) using vLLM. Consumer hardware will see proportional gains.

The important thing to note: these are lossless speedups. Because the target model retains the final verification step, every output token has been validated by the full 31B model. The output quality, reasoning accuracy, and factual recall are identical to running the model without the drafter. There is no quantization, no distillation, no pruning โ€” the big model still calls the shots.

How to Use MTP Drafters

The MTP drafters are available now under the Apache 2.0 license โ€” the same permissive license as the base Gemma 4 models. This means you can use them in commercial projects without restrictions.

To get started:

  1. Download the drafter weights from the Gemma 4 collection on Hugging Face. Each model variant has a corresponding drafter (e.g., gemma-4-31b-it-mtp-drafter).
  2. Use a compatible inference engine. Google has validated support in:
  • LiteRT-LM โ€” Google's lightweight runtime for on-device deployment
  • MLX โ€” Apple's ML framework for Apple Silicon
  • Hugging Face Transformers โ€” with speculative decoding support added in the latest release
  • vLLM โ€” the popular high-throughput serving framework
  1. Load both models. The target model loads as normal. The drafter loads alongside it โ€” the shared KV cache keeps memory overhead minimal.

On a MacBook Pro with M4 Max (128GB unified memory), running Gemma 4 26B with the MTP drafter yields a comfortable 35+ tok/s โ€” fast enough for real-time chat applications without any cloud dependency.

Why This Matters for Local AI

The practical impact of MTP drafters extends beyond raw speed numbers.

Lower latency for interactive use. At 10 tok/s, a chatbot response feels sluggish โ€” there's a visible pause between each word appearing. At 30 tok/s, the output looks nearly real-time. For developers building AI-native applications where UX matters, this is the difference between "impressive but slow" and "actually usable."

Better hardware utilization. Consumer GPUs (and especially Apple Silicon's unified memory architecture) are severely underutilized during standard autoregressive inference. The drafter architecture keeps the compute units fed with work, effectively getting more value out of the hardware you already own.

On-device AI becomes practical. The E2B and E4B models with MTP drafters reach speeds that make on-device text generation viable for real applications โ€” not just toy demos. For mobile developers building AI features that need to work offline, this opens up possibilities that were previously out of reach.

Apache 2.0 means no vendor lock-in. Unlike proprietary inference optimization services (Anthropic's, OpenAI's batch APIs, etc.), these drafters run entirely on your hardware with code you control. No API calls, no per-token billing, no data leaving your machine.

The Bigger Picture: Speculative Decoding Is the Future

Gemma 4's MTP drafters aren't an isolated release โ€” they're part of a broader industry shift.

Google has been steadily advancing speculative decoding techniques: on the same day as the Gemma 4 MTP announcement, the Google Developers Blog published a separate post on diffusion-style speculative decoding for TPUs, achieving similar 3x speedups on Google's custom hardware.

AWS has integrated speculative decoding into Trainium via vLLM. NVIDIA is building drafter-like architectures into TensorRT-LLM. The technique is becoming a standard layer in the inference stack โ€” as fundamental as KV caching or quantization.

The reason is straightforward: as models get larger and more capable, the gap between what they can do and how fast they feel will only widen. Speculative decoding bridges that gap without architectural compromises. It's not a workaround โ€” it's a genuine optimization that exploits the inherent redundancy in natural language generation.

Frequently Asked Questions

Does the MTP drafter reduce output quality?

No. The target model (e.g., Gemma 4 31B) verifies every token in the draft. If the drafter proposes a token the big model doesn't agree with, it's rejected. The output distribution is mathematically identical to running the target model alone.

How much extra VRAM does the drafter need?

The drafters are tiny compared to the target model โ€” they share the KV cache and activations. Total VRAM increase is roughly 5-10% of the target model's footprint. For the 31B model, that's about 1-2 GB extra.

Can I use MTP drafters with any LLM, or just Gemma 4?

The drafters are specifically trained for the Gemma 4 family. The technique (speculative decoding) is model-agnostic, but the checkpoint weights are Gemma 4-specific. Google has open-sourced the drafter weights; third-party implementations of speculative decoding exist for Llama, Mistral, and other model families, but with different quality/speed tradeoffs.

Which runtimes support MTP drafters?

LiteRT-LM, MLX, Hugging Face Transformers, and vLLM. Support for Ollama and llama.cpp is expected from the community in the coming weeks.

Does this work on phones?

Yes. The E2B (2B) and E4B (4B) drafters include a clustering optimization specifically for edge/mobile inference. Google tested these on Pixel devices and confirmed meaningful speed improvements.

Is the MTP drafter open source?

Yes. Both the base Gemma 4 models and the MTP drafter weights are released under the Apache 2.0 license.

Verdict

The Gemma 4 MTP drafters solve the most practical bottleneck in local LLM deployment โ€” inference speed โ€” without asking developers to compromise on model quality, give up local control, or buy new hardware. A genuine 3x speedup that's lossless, open-source, and production-ready is rare in AI. This one delivers.

For developers running Gemma 4 locally, there's no reason not to use the drafter. For developers who haven't tried Gemma 4 yet, the drafter removes the biggest objection: "it feels too slow."

Download the weights on Hugging Face and see the difference for yourself.

โ† Back to all posts