paris / README.md
Bagel Labs
arxiv citation
712e480 verified
---
license: mit
tags:
- text-to-image
- diffusion
- multi-expert
- dit
- laion
- distributed
- decentralized
- flow-matching
---
<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/>
<h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1>
<a href="https://huggingface.co/bageldotcom/paris" target="_blank">
<img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40">
</a>
<a href="https://github.com/bageldotcom/paris" target="_blank">
<img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40">
</a>
<a href="https://github.com/bageldotcom/paris/blob/main/paper.pdf" target="_blank">
<img src="https://img.shields.io/badge/📄_READ_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Technical Report" height="40">
</a>
<div style="margin-top: 20px;"></div>
The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/paris/blob/main/paper.pdf) to learn more.
# Key Characteristics
- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
- Lightweight transformer router (~129M parameters) for dynamic expert selection
- 11M LAION-Aesthetic images across 120 A40 GPU-days
- 14× less training data than prior decentralized baselines
- 16× less compute than prior decentralized baselines
- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
- Open weights for research and commercial use under MIT license
---
# Examples
![Paris Generation Examples](images/generated_images.png)
*Text-conditioned image generation samples using Paris across diverse prompts and visual styles*
---
# Architecture Details
| Component | Specification |
|-----------|--------------|
| **Model Scale** | DiT-XL/2 |
| **Parameters per Expert** | 605M |
| **Total Expert Parameters** | 4.84B (8 experts) |
| **Router Parameters** | ~129M |
| **Hidden Dimensions** | 1152 |
| **Transformer Layers** | 28 |
| **Attention Heads** | 16 |
| **Patch Size** | 2×2 (latent space) |
| **Latent Resolution** | 32×32×4 |
| **Image Resolution** | 256×256 |
| **Text Conditioning** | CLIP ViT-L/14 |
| **VAE** | sd-vae-ft-mse (8× downsampling) |
---
# Training Approach
Paris implements fully decentralized training where:
- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
- No gradient synchronization, parameter sharing, or activation exchange between experts during training
- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
- Router trained post-hoc on full dataset for expert selection during inference
- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)
![Training Architecture](images/training_architecture.png)
*Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.*
This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.
**Comparison with Traditional Parallelization**
| **Strategy** | **Synchronization** | **Straggler Impact** | **Topology Requirements** |
|--------------|---------------------|---------------------|---------------------------|
| Data Parallel | Periodic all-reduce | Slowest worker blocks iteration | Latency-sensitive cluster |
| Model Parallel | Sequential layer transfers | Slowest layer blocks pipeline | Linear pipeline |
| Pipeline Parallel | Stage-to-stage per microbatch | Bubble overhead from slowest stage | Linear pipeline |
| **Paris** | **No synchronization** | **No blocking** | **Arbitrary** |
---
### Routing Strategies
- **`top-1`** (default): Single best expert per step. Fastest inference, competitive quality.
- **`top-2`**: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
- **`full-ensemble`**: All 8 experts weighted by router. Highest compute (8× cost).
![Paris Inference Pipeline](images/paris_inference.png)
*Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).*
---
# Performance Metrics
**Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)**
| **Inference Strategy** | **FID-50K ↓** |
|------------------------|---------------|
| Monolithic (single model) | 29.64 |
| Paris Top-1 | 30.60 |
| **Paris Top-2** | **22.60** |
| Paris Full Ensemble | 47.89 |
*Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.*
---
# Training Details
**Hyperparameters (DiT-XL/2)**
| **Parameter** | **Value** |
|---------------|-----------|
| Dataset | LAION-Aesthetic (11M images) |
| Clustering | DINOv2 semantic features |
| Batch Size | 16 per expert (effective 32 with 2-step accumulation) |
| Learning Rate | 2e-5 (AdamW, no scheduling) |
| Training Steps | ~120k total across experts (asynchronous) |
| EMA Decay | 0.9999 |
| Mixed Precision | FP16 with automatic loss scaling |
| Conditioning | AdaLN-Single (23% parameter reduction) |
**Router Training**
| **Parameter** | **Value** |
|---------------|-----------|
| Architecture | DiT-B (smaller than experts) |
| Batch Size | 64 with 4-step accumulation (effective 256) |
| Learning Rate | 5e-5 with cosine annealing (25 epochs) |
| Loss | Cross-entropy on cluster assignments |
| Training | Post-hoc on full dataset |
---
# Citation
```bibtex
@misc{jiang2025paris,
title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
year={2025},
eprint={2510.03434},
archivePrefix={arXiv},
primaryClass={cs.GR},
url={https://arxiv.org/abs/2510.03434}
}
```
---
# License
MIT License – Open for research and commercial use.
Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a>