DiffusionGemma: 1,000+ Tokens/Sec Open-Weights Text Gen

June 15, 2026·AI Newsroom·15 min read·8 sources

google-deepmind gemma diffusiongemma text-diffusion open-weights apache-2.0+8high-risk claims

DiffusionGemma social card illustration — Image: Google blog / DiffusionGemma announcement (June 10, 2026)

On June 10, 2026, Google DeepMind released DiffusionGemma, an open-weights model that abandons the single-token autoregressive pattern and adopts a text-diffusion approach: instead of generating one word at a time, it denoises 256-token blocks in parallel, reaching 1,000+ tokens/sec on a single NVIDIA H100 — roughly 4x faster than an equivalent autoregressive model in the same single-user regime (Google blog, June 10, 2026; NVIDIA blog, June 10, 2026). It’s a 26 billion total parameter, 3.8 billion active release, based on the Gemma 4 Mixture-of-Experts (MoE) architecture, released under the Apache 2.0 license with day-one support in vLLM, Hugging Face Transformers, Unsloth, NVIDIA NeMo, and TensorRT-LLM with NVFP4 format on Blackwell GPUs (Google blog, June 10, 2026; Hugging Face model card). Google explicitly labels it as “experimental”: quality is below Gemma 4 AR standard — speed is the point, not peak quality.

What happened

The June 10, 2026 release. The post on Google’s official blog, authored by Research Scientists Brendan O’Donoghue and Sebastian Flennerhag, describes DiffusionGemma as an “experimental open model that explores text diffusion, an exceptionally fast approach to text generation” (Google blog, June 10, 2026). The Google DeepMind product page confirms “an experimental open model that explores an exceptionally fast approach to text generation” and explicitly notes “DiffusionGemma’s overall output quality is lower than standard Gemma 4” (DeepMind, DiffusionGemma model page; Google blog, June 10, 2026). Weights are available on Hugging Face under google/diffusiongemma-26B-A4B-it with apache-2.0 license explicitly declared in the model card (Hugging Face model card). The release also includes an nvidia/diffusiongemma-26B-A4B-it-NVFP4 variant optimized with NVFP4 4-bit floating-point on Blackwell GPUs (NVIDIA, NVFP4 build).

The architecture. DiffusionGemma inherits the backbone of Gemma 4 26B A4B (released April 2026) — a Mixture-of-Experts with 8 active experts out of 128 total plus 1 shared, for 25.2 billion total parameters and 3.8 billion active at each step — and combines it with a diffusion head that denoises 256-token blocks in parallel instead of predicting one token at a time (Google blog, June 10, 2026; Hugging Face model card; SiliconANGLE, June 10, 2026). The model is multimodal in input — text, images, video — but generates only text output (Hugging Face model card). Context length reaches 256K tokens, with a 1024-token sliding window. It’s an encoder-decoder architecture: the encoder handles prompt prefill and produces the KV cache, the decoder applies bidirectional attention on the generation canvas, accessing context via cross-attention (Hugging Face model card). Compared to Gemma 4 AR, the attention mechanism is no longer purely causal — “The new attention module also reviews the text that follows a given word,” writes SiliconANGLE (SiliconANGLE, June 10, 2026).

The speed numbers. On NVIDIA GPUs, declared numbers are:

Hardware	Tokens/sec	Source
NVIDIA H100 (single)	1,000+ (FP8, low batch size)	Hugging Face model card; Google blog, June 10, 2026
NVIDIA RTX 5090	700+ (quantized)	Google blog, June 10, 2026; SiliconANGLE, June 10, 2026
NVIDIA DGX Spark (GB10)	150	NVIDIA blog, June 10, 2026
NVIDIA DGX Station	2,000	NVIDIA blog, June 10, 2026

The Hugging Face model card more precisely reports “per user generation speeds exceeding 1100 tokens per second in low batch size settings (H100, FP8)” (Hugging Face model card). NVIDIA cites “roughly 4× faster than an equivalent autoregressive model running in the same single-user regime” (NVIDIA blog, June 10, 2026). Google uses the formulation “up to 4× faster” and specifies: “The throughput advantage is strongest at low-to-medium batch sizes on a single accelerator” (Google blog, June 10, 2026). In high-QPS cloud serving regime, Google continues, “autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma’s parallel decoding offers diminishing returns and can result in higher serving costs” — an important caveat for anyone choosing the architecture based on load profile.

The day-one ecosystem. The release is accompanied by integrations in: Hugging Face Transformers (model loadable with DiffusionGemmaForBlockDiffusion), vLLM (with day-zero serving support, “with integration supported by Red Hat” per Google blog; see Red Hat AI collection), MLX (for Apple Silicon), Unsloth (for fine-tuning, with a tutorial for solving Sudoku — a task where AR models struggle because “each token depends on future tokens”), Hackable Diffusion (modular JAX toolbox for composability), and NVIDIA NeMo (Google blog, June 10, 2026). For cloud deployment there are Gemini Enterprise Agent Platform Model Garden and NVIDIA NIM (Google blog, June 10, 2026; NVIDIA NIM container). llama.cpp is coming “soon” (Google blog, June 10, 2026).

TensorRT-LLM and NVFP4. The NVIDIA partnership brings TensorRT-LLM optimization to the hardware stack: “Native support for NVFP4 (4-bit floating-point) accelerates compute throughput, allowing the model to run at faster speeds with near-lossless accuracy” (Google blog, June 10, 2026). The NVFP4 format — 4-bit floating-point native on Blackwell GPUs — halves memory consumption and accelerates compute, with quality degradation described as “near-lossless” (DeepMind, DiffusionGemma model page).

The benchmarks. The Hugging Face model card publishes a comparison table between DiffusionGemma 26B A4B (instruction-tuned, with the recommended Entropy Bound sampler) and Gemma 4 26B A4B autoregressive. Numbers declared by Google (not independent benchmarks) are as follows (Hugging Face model card):

Benchmark	DiffusionGemma	Gemma 4 AR	Δ
MMLU Pro	77.6%	82.6%	−5.0 pp
AIME 2026 (no tools)	69.1%	88.3%	−19.2 pp
LiveCodeBench v6	69.1%	77.1%	−8.0 pp
Codeforces ELO	1,429	1,718	−289 ELO
GPQA Diamond	73.2%	82.3%	−9.1 pp
Tau2 (average over 3)	56.2%	68.2%	−12.0 pp
HLE no tools	11.0%	8.7%	+2.3 pp
HLE with search	11.9%	17.2%	−5.3 pp
BigBench Extra Hard	47.6%	64.8%	−17.2 pp
MMMLU	81.5%	86.3%	−4.8 pp
MMMU Pro (vision)	54.3%	73.8%	−19.5 pp
OmniDocBench 1.5 (edit distance, ↓)	0.319	0.149	+0.170 (worse)
MATH-Vision	70.5%	82.4%	−11.9 pp
MedXPertQA MM	49.0%	58.1%	−9.1 pp
MRCR v2 8 needle 128k	32.0%	44.1%	−12.1 pp

The pattern is clear: DiffusionGemma systematically loses on nearly every benchmark — from a few points (MMLU Pro) up to 19.5 points (MMMU Pro) and 19.2 points (AIME 2026). The only exception is HLE no tools (+2.3 pp), but that data should be taken with caution because absolute scores are very low (8.7%-11.0%). On coding and mathematical reasoning, the delta is large: −8.0 pp on LiveCodeBench v6, −19.2 pp on AIME 2026, −289 ELO on Codeforces. On long-context tasks, the loss is clear: −12.1 pp on MRCR v2 8 needle 128k. On document parsing (OmniDocBench 1.5), the edit distance metric is more than double compared to Gemma 4 AR.

Why it matters

1. Text-diffusion becomes realistic at 26B parameters. Text-diffusion isn’t a new idea — it’s an open research direction for years — but bringing it to a 26B open-weights model, with measured quality, distributed inference support, and TensorRT-LLM NVFP4 builds, is a leap. Until now, the text-diffusion frontier was either in the lab (Gemini Diffusion, proprietary and accessible via limited API) or in smaller models. DiffusionGemma is the first open-weights text-diffusion model at this scale (Google blog, June 10, 2026; SiliconANGLE, June 10, 2026). For anyone working on end-to-end latency, this is the first time parallel decoding is available locally with a documented and reproducible architecture.

2. It shifts the latency calculation for agentic tools and editors with auto-completion. The explicit target is “speed-critical, interactive local workflows such as in-line editing, rapid iteration, and generating non-linear text structures” (Google blog, June 10, 2026). In single-user regime, the bottleneck for AR LLMs is memory bandwidth: the GPU waits for the next token, it doesn’t do math. Text-diffusion shifts the bottleneck from memory-bound to compute-bound — “DiffusionGemma’s design plays directly to the GPU’s strengths” (NVIDIA blog, June 10, 2026). For anyone building agent loops, real-time completions, IDE plugins, code infilling, inline editing: the equivalent time-to-first-token (a 256-token block) is an order of magnitude lower.

3. Bidirectional attention enables non-linear patterns that AR can’t handle. “Generating 256 tokens in parallel with each forward pass allows every token to attend to all others. This provides significant advantages for non-linear domains such as in-line editing, code infilling, amino acid sequences or mathematical graphs” (Google blog, June 10, 2026). Google cites the Sudoku case: Unsloth fine-tuned DiffusionGemma to solve Sudoku — a task where autoregressive models struggle because each token depends on future tokens. With bidirectional attention, the causal constraint disappears: every token sees all others contextually. This makes text-diffusion particularly suited for code infilling, inline editing, structured form completion, table filling — tasks where an AR model must simulate the effect of future tokens before emitting them.

4. It fits on consumer GPUs, in quantized regime. The MoE model activates 3.8 billion parameters at each step. Quantized, it “fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs” per the Google blog (Google blog, June 10, 2026), or “24GB VRAM limits of a consumer NVIDIA RTX 5090 or 4090” per the DeepMind page (DeepMind, DiffusionGemma model page) — there’s a discrepancy between 18 GB (Google blog) and 24 GB (DeepMind) that likely depends on quantization level and runtime overhead. Users must measure on their own setup. For an RTX 5090 (32 GB) or RTX 4090 (24 GB), the model runs; for GPUs with less than 16 GB VRAM, the choice is aggressive quantization or llama.cpp coming soon. “No cloud, no per-token cost,” writes NVIDIA (NVIDIA blog, June 10, 2026).

5. It’s explicitly an experimental release, and quality shows it. Google doesn’t hide the trade-off: “DiffusionGemma’s overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4” (Google blog, June 10, 2026). Model card numbers confirm: −19.2 pp on AIME 2026, −8.0 pp on LiveCodeBench v6, −289 ELO on Codeforces, −12.1 pp on MRCR v2 long-context. For tasks requiring reasoning, complex code, or fidelity on long documents, Gemma 4 AR is still the right choice. DiffusionGemma is the right choice where end-to-end latency is the metric that matters — and where output can be validated, refined, or combined with an AR model in a hybrid pipeline.

What to watch

Independent benchmarks on SWE-bench, LiveCodeBench v6, MMLU, GPQA, AIME. Numbers cited above are from Google’s model card, not independent benchmarks. For serious production adoption, third-party runs (Stanford CRFM, Hugging Face Open LLM Leaderboard, Artificial Analysis, Aider benchmark, SWE-bench Verified, Vellum AI) are needed to confirm deltas and — above all — measure real throughput under load. Monitor in the 30-90 days post-release.
Generated code quality on real tasks (code infilling, refactoring, agent loop). −289 ELO on Codeforces is a strong signal. For a model marketed as experimental, measuring quality on practical coding tasks (HumanEval+, MBPP+, BigCodeBench) is the first adoption filter. If code quality holds only on simple tasks, DiffusionGemma is an IDE plugin toy and nothing more.
Hugging Face TGI integration and vLLM serving stability. vLLM has day-zero support; Hugging Face TGI doesn’t. Production serving requires runtime stability, KV cache management for block-autoregressive decoding, and NVFP4 compatibility across vendors. Monitor issues on vllm-project/vllm and huggingface/text-generation-inference for the next 30 days.
Blackwell NVFP4 TensorRT-LLM support and speed claim truncation. NVIDIA cites “near-lossless accuracy” for NVFP4 — near-lossless isn’t lossless. For deployments using NVFP4 on Blackwell, measuring quality degradation (especially on reasoning and long-context benchmarks) is essential before moving production loads. Real quality samples will depend on the workload.
When llama.cpp arrives. “llama.cpp support is arriving soon” (Google blog, June 10, 2026). llama.cpp enables CPU inference and a much wider range of consumer hardware (Mac M-series, integrated GPUs, embedded systems). When llama.cpp supports DiffusionGemma, deployment on laptops and non-NVIDIA hardware becomes realistic — an important jump for real adoption.
Fine-tuning for specific use cases: how much quality improves. Google cites the Unsloth Sudoku tutorial. “You can improve DiffusionGemma’s performance on specific tasks through fine-tuning” (Google blog, June 10, 2026). The question: does the quality gap versus Gemma 4 AR close with targeted fine-tuning? If yes, adoption for vertical use cases (code infilling in IDE, completion in CRM, structured data extraction) becomes viable.
Does an open-weights text-diffusion competitor at ≥26B emerge? The category was just born at this scale. If Meta, Mistral, Alibaba (Qwen), or DeepSeek release a text-diffusion open-weights model in the next 6 months, the space fills up. If not, DiffusionGemma is Google’s first-mover advantage in a market segment that counts.
Reports on hallucinations and safety. The model card cites “significant improvements over previous Gemma models” on Google’s internal safety evaluations (Hugging Face model card). That’s a manufacturer claim. Independent benchmarks (HarmBench, AdvBench, third-party red teams) will arrive — monitor.

Risks and caveats

“Experimental” isn’t a throwaway word. Google labels the model as “experimental” in both the blog post title and the DeepMind product page. Quality is explicitly declared lower than Gemma 4 AR. “For applications that demand maximum quality, we recommend deploying standard Gemma 4” (Google blog, June 10, 2026). The article avoids formulations like “production-ready” or “alternative to Gemma 4”: the reality is that it’s a model for specific use cases, not a drop-in replacement.
Model card benchmarks are self-reported. The comparison table with Gemma 4 AR comes from the Hugging Face model card compiled by Google. They’re not independent benchmarks (Stanford CRFM, Open LLM Leaderboard, Artificial Analysis). The article cites the numbers but flags the source: for serious adoption, they need to be replicated on realistic workloads. “At the time of publication, no independent third-party benchmarks on DiffusionGemma are available.”
The 18 GB vs 24 GB VRAM discrepancy between Google blog and DeepMind. The Google blog cites “18GB VRAM limits of high-end dedicated consumer GPUs”; the DeepMind page cites “24GB VRAM limits of a consumer NVIDIA RTX 5090 or 4090”. They’re different numbers for the same scenario. Likely depends on quantization level (NVFP4 vs BF16), runtime overhead, or different headroom thresholds. The article cites both and flags the discrepancy — measure on your own setup is the rule.
The speed advantage collapses in high-QPS cloud serving. Google itself warns: “In high-QPS cloud serving, autoregressive models can be deployed to saturate compute efficiently, so DiffusionGemma’s parallel decoding offers diminishing returns and can result in higher serving costs” (Google blog, June 10, 2026). For anyone thinking of using DiffusionGemma as a production serving model at scale, the choice needs reconceptualization: the advantage is in single-user / low-batch / local inference, not concurrent cloud serving.
MoE model complicates deployment. 26B total parameters with 3.8B active means not all 26B need to be in VRAM simultaneously — but serving requires loading them all or managing complex expert offloading. For deployment on low-VRAM GPUs, dynamic expert offloading adds latency that can erode the advantage. Measure case by case.
−289 ELO on Codeforces and −19.2 pp on AIME 2026. These numbers describe reasoning quality. For tasks requiring long logical chains, symbolic math, or complex problem-solving, DiffusionGemma is objectively weaker than Gemma 4 AR. The article doesn’t propagate the framing “DiffusionGemma is good for everything”: for those tasks, AR remains the right choice.
No Apple Silicon claim. Google explicitly warns: “unified-memory architectures like those in Apple Silicon Macs — which are often memory-bandwidth-bound rather than compute-bound during inference — may not see the same acceleration over autoregressive models like Gemma 4” (Google blog, June 10, 2026, footnote 1). On MacBook Pro M-series, acceleration may not arrive — and MLX is the cited workaround, but with caveats.
No independent benchmarks at time of publication. The article doesn’t extrapolate numbers into claims beyond the model card. Real quality for specific use cases (production code agent, CRM structured extraction, IDE completion) requires internal testing on representative workloads.
The exceeding 1100 tokens/sec numbers are FP8 on H100, low batch size. “Per user generation speeds exceeding 1100 tokens per second in low batch size settings (H100, FP8)” (Hugging Face model card). The article cites the number, but clarifies: it’s in a specific regime. In BF16, in batch size > 1, on different hardware, the number changes.
The model is multimodal in input, not output. DiffusionGemma accepts text + images + video in input but generates only text. The article avoids formulations like “fully multimodal model” that would be misleading. It’s a text-generation model with vision-language input capabilities, not a model that generates images or video.

What to do

For anyone building agentic tools with low-latency loops (IDE plugin, code agent, auto-completion editor). DiffusionGemma is a candidate to try in shadow mode: run DiffusionGemma in parallel to your production AR model on a sample of real requests, measure the latency delta (time-to-block, time-to-first-useful-token) and quality delta (pass rate on internal test suites, completion acceptance, code review acceptance). If quality holds on your workload and latency drops 3-4x, it’s a serious candidate for completion use cases — not for long reasoning tasks. vLLM, Unsloth, and Hugging Face Transformers day-one integrations lower experimentation cost. Start with the nvidia/diffusiongemma-26B-A4B-it-NVFP4 build on an H100 or RTX 5090 (NVIDIA blog, June 10, 2026).

For those building AR + diffusion hybrid pipelines. The most promising pattern isn’t replacing Gemma 4 AR with DiffusionGemma, but using both: DiffusionGemma for fast first drafts (inline completion, code infilling, structured fill), Gemma 4 AR for refinement on requests requiring reasoning. Text-diffusion is complementary to AR, not substitutive, on real workloads. A two-stage draft-and-refine architecture can leverage the best of both: the speed of the first and the quality of the second.

For those seeking low-cost local inference on consumer GPUs. With an RTX 5090 (32 GB) or RTX 4090 (24 GB), the model runs quantized. “No cloud, no per-token cost” (NVIDIA blog, June 10, 2026). For single-user workloads — IDE completion, local agent, prototyping — the math is: zero token cost, sub-100ms latency at steady state, quality below Gemma 4 AR but acceptable for completion use cases. When llama.cpp arrives, deployment opens up to Mac M-series and embedded systems.

For those doing search & retrieval on long documents. The 256K context is wide, but −12.1 pp on MRCR v2 8 needle 128k signals that long-document retrieval suffers. For a RAG system on real documents, don’t replace a validated AR model with DiffusionGemma. Keep it for use cases where generation of contiguous blocks matters more than retrieval on long texts.

For investors and teams evaluating Google’s strategic positioning. The move to release a text-diffusion open-weights model at 26B before competitors (Meta, Mistral, Alibaba) is a first-mover claim. If text-diffusion adoption grows in the next 6-12 months, Google has the default in the open ecosystem. Monitor: (a) Hugging Face adoption (downloads, derivatives, fine-tunes); (b) issue and PR count on vLLM and HF Transformers for DiffusionGemma; (c) third-party tool integrations (Cursor, Continue.dev, Cline, agent frameworks). If Hugging Face shows a surge of derivatives and fine-tunes, the adoption signal is confirmed.

For those following open-weights security. The model card cites “major improvements over previous Gemma models” on Google’s internal safety evaluations (Hugging Face model card). That’s a manufacturer claim. For responsible AI in production, monitor independent benchmarks (HarmBench, AdvBench) and third-party red team reports that will arrive. Harmful content generation is explicitly listed as a risk in the model card.

Verdict

On June 10, 2026, Google DeepMind released DiffusionGemma: 26B/3.8B active, Gemma 4 MoE, 256 tokens denoised in parallel, 1,000+ tokens/sec on a single H100, ~4x faster than an equivalent AR model in single-user regime, Apache 2.0 license, day-one support in vLLM, Hugging Face Transformers, Unsloth, NeMo, TensorRT-LLM NVFP4 on Blackwell. It’s the first open-weights text-diffusion model at this scale — and Google explicitly labels it “experimental”, with quality systematically below Gemma 4 AR (up to −19.2 pp on AIME 2026, −289 ELO on Codeforces, −12.1 pp on long-context). The usage map is clear: where end-to-end latency is the metric that matters — IDE completion, code infilling, single-user agent loops, local inference on consumer GPUs — DiffusionGemma is a serious candidate. Where reasoning quality or long-document fidelity is the metric that matters, Gemma 4 AR remains the right choice. The throughput advantage collapses in high-QPS cloud serving, where AR saturates compute better. Google’s strategic move is to position itself first in a category just born at this scale — and open weights with Apache 2.0, optimized TensorRT-LLM builds, and day-one ecosystem (vLLM, HF, Unsloth) is a barrier to entry for competitors. For the next wave of agentic tools, the game plays out in how quickly DiffusionGemma closes the quality gap with fine-tuning — and when llama.cpp enables inference on Mac and non-NVIDIA hardware.