Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis

Shufan Li1,2, Greg Heinrich2, Hanrong Ye2, Yonggan Fu2, Aditya Grover1, Jan Kautz2, Pavlo Molchanov2
1UCLA 2NVIDIA
Generation demo 1 Generation demo 2 Generation demo 3

NL-Diffusion-Image generates high-resolution images through iterative masked discrete diffusion, progressively unmasking image tokens from noise to a coherent scene.

Abstract

We present NL-Diffusion-Image, a text-to-image generation model that advances masked discrete diffusion for high-resolution image synthesis. Unlike continuous diffusion models that operate in latent space, NL-Diffusion-Image treats image generation as iterative unmasking of discrete visual tokens, enabling parallel bidirectional decoding throughout the denoising process. We introduce an improved training objective — the Generalized Cross-Entropy (GCE) objective — which stabilizes training and improves sample quality across a wide range of resolutions. Our model supports flexible few-step generation with controllable speed-quality tradeoffs and enables token-level image editing without additional fine-tuning. Extensive benchmarks demonstrate that NL-Diffusion-Image achieves state-of-the-art performance among discrete diffusion models while remaining highly competitive with leading continuous diffusion approaches.

Model Architecture

Model architecture diagram

NL-Diffusion-Image adopts a Transformer-based architecture operating on discrete visual token sequences. An image is encoded into a grid of discrete tokens using a VQ-based tokenizer. The masked diffusion model then learns to predict the clean tokens from a corrupted (partially masked) sequence, conditioned on the text prompt via cross-attention. During inference, the model iteratively unmaskes tokens in parallel over multiple diffusion steps, progressing from a fully masked sequence to a complete image.

GCE Training Objective

GCE objective illustration

A key contribution of this work is the Generalized Cross-Entropy (GCE) objective, which improves upon the standard masked language modeling loss used in prior discrete diffusion models. By down-weighting easy predictions and focusing training on harder, more uncertain tokens, GCE provides better gradient signal throughout the denoising trajectory and leads to substantially improved generation quality without increasing training cost.

Few-Step Generation

Few-step generation examples

NL-Diffusion-Image supports flexible few-step generation, allowing users to trade quality for speed by reducing the number of denoising steps. Even with as few as 8 steps, the model produces highly coherent and visually appealing images, while 64 steps yield the best quality. This flexibility makes the model practical for a range of latency-sensitive applications.

Token-Level Image Editing

Token editing examples

Because NL-Diffusion-Image operates on discrete visual tokens, it naturally supports token-level image editing without any additional fine-tuning. Given an input image and an editing instruction, the model selectively remasks and regenerates only the relevant tokens, preserving the remainder of the image. This enables precise local edits such as object replacement, style transfer, and background modification while maintaining global consistency.

Speed Comparison

Speed comparison animation

Compared to autoregressive and continuous diffusion baselines, NL-Diffusion-Image achieves significantly faster wall-clock generation time thanks to its parallel unmasking strategy. The model generates all image tokens simultaneously at each step rather than sequentially, resulting in sub-second generation at 256×256 resolution and competitive throughput at 1024×1024 on modern hardware.

Benchmarks

Benchmark results

We evaluate NL-Diffusion-Image on standard text-to-image benchmarks including GenEval and MJHQ. Our model achieves state-of-the-art performance among discrete diffusion approaches and remains competitive with leading continuous diffusion models, demonstrating that masked discrete diffusion is a viable and efficient paradigm for high-resolution image synthesis.

BibTeX

@article{li2026nemotron,
  title   = {Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion
             for High-Resolution Image Synthesis},
  author  = {Li, Shufan and Heinrich, Greg and Ye, Hanrong and Fu, Yonggan and
             Grover, Aditya and Kautz, Jan and Molchanov, Pavlo},
  journal = {arXiv preprint arXiv:2606.29814},
  year    = {2026}
}