arxiv:2512.10858

Scaling Behavior of Discrete Diffusion Language Models

Published on Dec 11

· Submitted by

Dimitri von Rütte on Dec 15

ETH Zurich

Upvote

Authors:

Dimitri von Rütte ,

Abstract

Research on DLMs explores their scaling behavior under different noise types, revealing that uniform diffusion is more parameter-efficient and data-efficient compared to masked diffusion.

AI-generated summary

Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for 10^{22} FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.

View arXiv page View PDF GitHub 10 Add to collection

Community

dvruette

Paper author Paper submitter 1 day ago

We scale diffusion language models up to 3B (masked and uniform diffusion) and 10B (uniform diffusion) parameters, pre-trained on a pure diffusion objective (mixture of unconditional and conditional) via Nemotron-CC.

🤖 GitHub: https://github.com/dvruette/gidd-easydel
🤗 Huggingface: https://huggingface.co/collections/dvruette/scaling-behavior-of-discrete-diffusion-language-models

librarian-bot

about 11 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 3

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.10858 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.10858 in a Space README.md to link it from this page.