|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- text-to-image |
|
|
- diffusion |
|
|
- multi-expert |
|
|
- dit |
|
|
- laion |
|
|
- distributed |
|
|
- decentralized |
|
|
- flow-matching |
|
|
--- |
|
|
|
|
|
<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/> |
|
|
|
|
|
<h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1> |
|
|
|
|
|
<a href="https://huggingface.co/bageldotcom/paris" target="_blank"> |
|
|
<img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40"> |
|
|
</a> |
|
|
<a href="https://github.com/bageldotcom/paris" target="_blank"> |
|
|
<img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40"> |
|
|
</a> |
|
|
<a href="https://github.com/bageldotcom/paris/blob/main/paper.pdf" target="_blank"> |
|
|
<img src="https://img.shields.io/badge/📄_READ_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Technical Report" height="40"> |
|
|
</a> |
|
|
|
|
|
<div style="margin-top: 20px;"></div> |
|
|
|
|
|
The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/paris/blob/main/paper.pdf) to learn more. |
|
|
|
|
|
# Key Characteristics |
|
|
|
|
|
- 8 independently trained expert diffusion models (605M parameters each, 4.84B total) |
|
|
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training |
|
|
- Lightweight transformer router (~129M parameters) for dynamic expert selection |
|
|
- 11M LAION-Aesthetic images across 120 A40 GPU-days |
|
|
- 14× less training data than prior decentralized baselines |
|
|
- 16× less compute than prior decentralized baselines |
|
|
- Competitive generation quality (FID 12.45 on DiTExpert XL/2) |
|
|
- Open weights for research and commercial use under MIT license |
|
|
|
|
|
--- |
|
|
|
|
|
# Examples |
|
|
|
|
|
 |
|
|
|
|
|
*Text-conditioned image generation samples using Paris across diverse prompts and visual styles* |
|
|
|
|
|
--- |
|
|
|
|
|
# Architecture Details |
|
|
|
|
|
| Component | Specification | |
|
|
|-----------|--------------| |
|
|
| **Model Scale** | DiT-XL/2 | |
|
|
| **Parameters per Expert** | 605M | |
|
|
| **Total Expert Parameters** | 4.84B (8 experts) | |
|
|
| **Router Parameters** | ~129M | |
|
|
| **Hidden Dimensions** | 1152 | |
|
|
| **Transformer Layers** | 28 | |
|
|
| **Attention Heads** | 16 | |
|
|
| **Patch Size** | 2×2 (latent space) | |
|
|
| **Latent Resolution** | 32×32×4 | |
|
|
| **Image Resolution** | 256×256 | |
|
|
| **Text Conditioning** | CLIP ViT-L/14 | |
|
|
| **VAE** | sd-vae-ft-mse (8× downsampling) | |
|
|
|
|
|
--- |
|
|
|
|
|
# Training Approach |
|
|
|
|
|
Paris implements fully decentralized training where: |
|
|
|
|
|
- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering) |
|
|
- No gradient synchronization, parameter sharing, or activation exchange between experts during training |
|
|
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds |
|
|
- Router trained post-hoc on full dataset for expert selection during inference |
|
|
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink) |
|
|
|
|
|
 |
|
|
|
|
|
*Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.* |
|
|
|
|
|
This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training. |
|
|
|
|
|
**Comparison with Traditional Parallelization** |
|
|
|
|
|
| **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** | |
|
|
|--------------|---------------------|---------------------|---------------------------| |
|
|
| Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster | |
|
|
| Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline | |
|
|
| Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline | |
|
|
| **Paris** | **No synchronization** | **No blocking** | **Arbitrary** | |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
### Routing Strategies |
|
|
|
|
|
- **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality. |
|
|
- **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost. |
|
|
- **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost). |
|
|
|
|
|
 |
|
|
|
|
|
*Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).* |
|
|
|
|
|
--- |
|
|
|
|
|
# Performance Metrics |
|
|
|
|
|
**Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)** |
|
|
|
|
|
| **Inference Strategy** | **FID-50K ↓** | |
|
|
|------------------------|---------------| |
|
|
| Monolithic (single model) | 29.64 | |
|
|
| Paris Top-1 | 30.60 | |
|
|
| **Paris Top-2** | **22.60** | |
|
|
| Paris Full Ensemble | 47.89 | |
|
|
|
|
|
*Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.* |
|
|
|
|
|
--- |
|
|
|
|
|
# Training Details |
|
|
|
|
|
**Hyperparameters (DiT-XL/2)** |
|
|
|
|
|
| **Parameter** | **Value** | |
|
|
|---------------|-----------| |
|
|
| Dataset | LAION-Aesthetic (11M images) | |
|
|
| Clustering | DINOv2 semantic features | |
|
|
| Batch Size | 16 per expert (effective 32 with 2-step accumulation) | |
|
|
| Learning Rate | 2e-5 (AdamW, no scheduling) | |
|
|
| Training Steps | ~120k total across experts (asynchronous) | |
|
|
| EMA Decay | 0.9999 | |
|
|
| Mixed Precision | FP16 with automatic loss scaling | |
|
|
| Conditioning | AdaLN-Single (23% parameter reduction) | |
|
|
|
|
|
**Router Training** |
|
|
|
|
|
| **Parameter** | **Value** | |
|
|
|---------------|-----------| |
|
|
| Architecture | DiT-B (smaller than experts) | |
|
|
| Batch Size | 64 with 4-step accumulation (effective 256) | |
|
|
| Learning Rate | 5e-5 with cosine annealing (25 epochs) | |
|
|
| Loss | Cross-entropy on cluster assignments | |
|
|
| Training | Post-hoc on full dataset | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
# Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{jiang2025paris, |
|
|
title={Paris: A Decentralized Trained Open-Weight Diffusion Model}, |
|
|
author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan}, |
|
|
year={2025}, |
|
|
eprint={2510.03434}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.GR}, |
|
|
url={https://arxiv.org/abs/2510.03434} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
# License |
|
|
|
|
|
MIT License – Open for research and commercial use. |
|
|
|
|
|
Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a> |