LaViDa: A Large Diffusion Language Model for Multimodal Understanding

1UCLA 2Panasonic AI Research 3Salesforce Research 4Adobe Research

LaViDa is a multi-modal discrete diffusion model (DM) for vison-language tasks.

Abstract

Modern Vision-Language Models (VLMs) can solve a wide range of tasks requiring visual reasoning. In real-world scenarios, desirable properties for VLMs include fast inference and controllable generation (e.g., constraining outputs to adhere to a desired format). However, existing autoregressive (AR) VLMs like LLaVA struggle in these aspects. Discrete diffusion models (DMs) offer a promising alternative, enabling parallel decoding for faster inference and bidirectional context for controllable generation through text-infilling. While effective in language-only settings, DMs' potential for multimodal tasks is underexplored. We introduce LaViDa, a family of VLMs built on DMs. We build LaViDa by equipping DMs with a vision encoder and jointly fine-tune the combined parts for multimodal instruction following. To address challenges encountered, LaViDa incorporates novel techniques such as complementary masking for effective training, prefix KV cache for efficient inference, and timestep shifting for high-quality sampling. Experiments show that LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks such as MMMU, while offering unique advantages of DMs, including flexible speed-quality tradeoff, controllability, and bidirectional reasoning. On COCO captioning, LaViDa surpasses Open-LLaVa-Next-8B by +4.1 CIDEr with 1.92x speedup. On bidirectional tasks, it achieves +59% improvement on Constrained Poem Completion. These results demonstrate LaViDa as a strong alternative to AR VLMs. Code and models will be released in the camera-ready version.

Model Architecture

LaViDa's model architecture follows a similar design to common AR VLMs like LLaVa. It consists of a vision encoder and a diffusion language model. These two parts are connected by a MLP projection network.

Vision Encoder: The input image is resized into multiple views, encoded by a vision encoder, pooled, and projected into a compact visual context sequence.

Diffusion Language Model: A non-causal Transformer processes the visual context, prompt, and masked response to generate the final output using a diffusion-based decoding objective.

Technical Design

LaViDa improves the efficiency and quality of diffusion-based vision-language modeling through complementary masking and a cacheable inference scheme.

Complementary Masking: To ensure all tokens contribute to training and improve gradient alignment with vision features, LaViDa uses two disjoint masked sequences per sample, boosting sample efficiency and encoder supervision.

Prefix-DLM Inference: During inference, LaViDa gradually unmasks tokens in discrete steps while caching visual and prompt representations using a prefix-style attention mask, significantly speeding up decoding compared to standard diffusion models.

Highlights

LaViDa provides better control and faster decoding than autoregressive (AR) vision-language models through flexible generation and tunable speed–quality tradeoffs.

Controllable Generation: LaViDa variants accurately follow structural constraints (e.g., poem lines) and allow flexible token allocation, unlike AR models which find it hard to follow the constraints.

Speed–Quality Tradeoff: By adjusting the number of diffusion steps, LaViDa enables dynamic balancing between inference latency and output quality on tasks like image captioning.

More Usecases

LaViDa offers unparalled flexibility in controllable generation. We showcase some creative applications such as structured data extraction, text editing, and script writing.