paris / README.md

Bagel Labs

arxiv citation

712e480 verified 3 months ago

7.11 kB

	---
	license: mit
	tags:
	- text-to-image
	- diffusion
	- multi-expert
	- dit
	- laion
	- distributed
	- decentralized
	- flow-matching
	---

	<img src="images/bagel_labs_logo.png" alt="Bagel Labs" height="28" style="margin-bottom: 20px;"/>

	<h1 style="font-size: 28px; margin-bottom: 20px;">Paris: A Decentralized Trained Open-Weight Diffusion Model</h1>

	<a href="https://huggingface.co/bageldotcom/paris" target="_blank">
	<img src="https://img.shields.io/badge/🤗_DOWNLOAD_MODEL_WEIGHTS-FFD21E?style=for-the-badge&logoColor=000000" alt="Download Model Weights" height="40">
	</a>
	<a href="https://github.com/bageldotcom/paris" target="_blank">
	<img src="https://img.shields.io/badge/⭐_STAR_ON_GITHUB-100000?style=for-the-badge&logo=github&logoColor=white" alt="Star on GitHub" height="40">
	</a>
	<a href="https://github.com/bageldotcom/paris/blob/main/paper.pdf" target="_blank">
	<img src="https://img.shields.io/badge/📄_READ_PAPER-FF6B6B?style=for-the-badge&logoColor=white" alt="Read Technical Report" height="40">
	</a>

	<div style="margin-top: 20px;"></div>

	The world's first open-weight diffusion model trained entirely through decentralized computation. The model consists of 8 expert diffusion models (129M-605M parameters each) trained in complete isolation with no gradient, parameter, or intermediate activation synchronization, achieving superior parallelism efficiency over traditional methods while using 14× less data and 16× less compute than baselines. [Read our technical report](https://github.com/bageldotcom/paris/blob/main/paper.pdf) to learn more.

	# Key Characteristics

	- 8 independently trained expert diffusion models (605M parameters each, 4.84B total)
	- No gradient synchronization, parameter sharing, or activation exchange among nodes during training
	- Lightweight transformer router (~129M parameters) for dynamic expert selection
	- 11M LAION-Aesthetic images across 120 A40 GPU-days
	- 14× less training data than prior decentralized baselines
	- 16× less compute than prior decentralized baselines
	- Competitive generation quality (FID 12.45 on DiTExpert XL/2)
	- Open weights for research and commercial use under MIT license

	---

	# Examples

	![Paris Generation Examples](images/generated_images.png)

	Text-conditioned image generation samples using Paris across diverse prompts and visual styles

	---

	# Architecture Details

	\| Component \| Specification \|
	\|-----------\|--------------\|
	\| Model Scale \| DiT-XL/2 \|
	\| Parameters per Expert \| 605M \|
	\| Total Expert Parameters \| 4.84B (8 experts) \|
	\| Router Parameters \| ~129M \|
	\| Hidden Dimensions \| 1152 \|
	\| Transformer Layers \| 28 \|
	\| Attention Heads \| 16 \|
	\| Patch Size \| 2×2 (latent space) \|
	\| Latent Resolution \| 32×32×4 \|
	\| Image Resolution \| 256×256 \|
	\| Text Conditioning \| CLIP ViT-L/14 \|
	\| VAE \| sd-vae-ft-mse (8× downsampling) \|

	---

	# Training Approach

	Paris implements fully decentralized training where:

	- Each expert trains independently on a semantically coherent data partition (DINOv2-based clustering)
	- No gradient synchronization, parameter sharing, or activation exchange between experts during training
	- Experts trained asynchronously across AWS, GCP, local clusters, and Runpod instances at different speeds
	- Router trained post-hoc on full dataset for expert selection during inference
	- Complete computational independence eliminates requirements for specialized interconnects (InfiniBand, NVLink)

	![Training Architecture](images/training_architecture.png)

	Paris training phase showing complete asynchronous isolation across heterogeneous compute clusters. Unlike traditional parallelization strategies (Data/Pipeline/Model Parallelism), Paris requires zero communication during training.

	This zero-communication approach enables training on fragmented compute resources without specialized interconnects, eliminating the dedicated GPU cluster requirement of traditional diffusion model training.

	Comparison with Traditional Parallelization

	\| Strategy \| Synchronization \| Straggler Impact \| Topology Requirements \|
	\|--------------\|---------------------\|---------------------\|---------------------------\|
	\| Data Parallel \| Periodic all-reduce \| Slowest worker blocks iteration \| Latency-sensitive cluster \|
	\| Model Parallel \| Sequential layer transfers \| Slowest layer blocks pipeline \| Linear pipeline \|
	\| Pipeline Parallel \| Stage-to-stage per microbatch \| Bubble overhead from slowest stage \| Linear pipeline \|
	\| Paris \| No synchronization \| No blocking \| Arbitrary \|

	---


	### Routing Strategies

	- `top-1` (default): Single best expert per step. Fastest inference, competitive quality.
	- `top-2`: Weighted ensemble of top-2 experts. Often best quality, 2× inference cost.
	- `full-ensemble`: All 8 experts weighted by router. Highest compute (8× cost).

	![Paris Inference Pipeline](images/paris_inference.png)

	Multi-expert inference pipeline showing router-based expert selection and three different routing strategies: Top-1 (fastest), Top-2 (best quality), and Full Ensemble (highest compute).

	---

	# Performance Metrics

	Multi-Expert vs. Monolithic on LAION-Art (DiT-B/2)

	\| Inference Strategy \| FID-50K ↓ \|
	\|------------------------\|---------------\|
	\| Monolithic (single model) \| 29.64 \|
	\| Paris Top-1 \| 30.60 \|
	\| Paris Top-2 \| 22.60 \|
	\| Paris Full Ensemble \| 47.89 \|

	Top-2 routing achieves 7.04 FID improvement over monolithic baseline, validating that targeted expert collaboration outperforms both single models and naive ensemble averaging.

	---

	# Training Details

	Hyperparameters (DiT-XL/2)

	\| Parameter \| Value \|
	\|---------------\|-----------\|
	\| Dataset \| LAION-Aesthetic (11M images) \|
	\| Clustering \| DINOv2 semantic features \|
	\| Batch Size \| 16 per expert (effective 32 with 2-step accumulation) \|
	\| Learning Rate \| 2e-5 (AdamW, no scheduling) \|
	\| Training Steps \| ~120k total across experts (asynchronous) \|
	\| EMA Decay \| 0.9999 \|
	\| Mixed Precision \| FP16 with automatic loss scaling \|
	\| Conditioning \| AdaLN-Single (23% parameter reduction) \|

	Router Training

	\| Parameter \| Value \|
	\|---------------\|-----------\|
	\| Architecture \| DiT-B (smaller than experts) \|
	\| Batch Size \| 64 with 4-step accumulation (effective 256) \|
	\| Learning Rate \| 5e-5 with cosine annealing (25 epochs) \|
	\| Loss \| Cross-entropy on cluster assignments \|
	\| Training \| Post-hoc on full dataset \|


	---

	# Citation

	```bibtex
	@misc{jiang2025paris,
	title={Paris: A Decentralized Trained Open-Weight Diffusion Model},
	author={Jiang, Zhiying and Seraj, Raihan and Villagra, Marcos and Roy, Bidhan},
	year={2025},
	eprint={2510.03434},
	archivePrefix={arXiv},
	primaryClass={cs.GR},
	url={https://arxiv.org/abs/2510.03434}
	}
	```

	---

	# License

	MIT License – Open for research and commercial use.

	Made with ❤️ by <a href="https://twitter.com/bageldotcom" target="_blank"><img src="https://img.shields.io/badge/Bagel_Labs-1DA1F2?style=for-the-badge&logo=twitter&logoColor=white" alt="Follow Bagel Labs on Twitter" height="28"></a>