Abstract
Research on DLMs explores their scaling behavior under different noise types, revealing that uniform diffusion is more parameter-efficient and data-efficient compared to masked diffusion.
Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for 10^{22} FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
Community
We scale diffusion language models up to 3B (masked and uniform diffusion) and 10B (uniform diffusion) parameters, pre-trained on a pure diffusion objective (mixture of unconditional and conditional) via Nemotron-CC.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Diffusion Language Models are Super Data Learners (2025)
- Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training (2025)
- Principled RL for Diffusion LLMs Emerges from a Sequence-Level Perspective (2025)
- Variational Masked Diffusion Models (2025)
- Guided Transfer Learning for Discrete Diffusion Models (2025)
- CDLM: Consistency Diffusion Language Models For Faster Sampling (2025)
- Simple Denoising Diffusion Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 3
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper