Instructions to use akrao9/Linarix-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use akrao9/Linarix-v1 with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("akrao9/Linarix-v1", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- Draw Things
- DiffusionBee
Linarix-v1 β Text-to-Image
Linarix-v1 is a ~712M parameter text-to-image diffusion prototype model that generates 1024Γ1024px images from text prompts.
Instead of standard quadratic self-attention, it uses GatedDeltaNet-2 (GDN-2) β a bidirectional Flash Linear Attention mixer with decoupled channel-wise erase/write gates β as the backbone of its transformer blocks. This keeps memory roughly flat with sequence length. Every 4th block adds a full SDPA layer with 2D RoPE for global spatial coherence, and both the image and cross-attention paths use Q/K RMSNorm to cap attention-logit growth.
Text conditioning uses Qwen3.5-2B (2048-dim embeddings, up to 320 tokens) read through an instruction prompt and chat template. Decoding uses the DC-AE f32c32 VAE with 32Γ spatial compression, producing 32Γ32 latents from 1024px images. Inference uses STORK-2 flow-matching sampling (Tan et al., 2025).
Pre-trained on a ~5M-image 512px latent mix, then fine-tuned on ~1024px FineT2I latents.
Sample Outputs
Generated at 1024Γ1024px, STORK-2, 32 steps, CFG 4.0. Prompts (leftβright, topβbottom):
- a misty pine forest at dawn, sunbeams through the trees
- a calm lake reflecting autumn mountains, golden hour
- a lavender field under a dramatic stormy sky
- a lighthouse on a rocky cliff above crashing waves
- a close-up portrait of a young woman, natural window light
- an old craftsman working at a wooden workbench, warm lamplight
- a red fox standing in fresh snow, looking at the camera
- a pair of swans on a still pond at sunrise
- a steam locomotive crossing a stone bridge
- a quiet cobblestone street in an old European town, evening
- a rustic bowl of strawberries on a linen cloth
Architecture
| Property | Value |
|---|---|
| Parameters | ~712M |
| Backbone | Bidirectional GatedDeltaNet-2 (Flash Linear Attention) |
| Depth | 24 layers (first 2 dual-stream) |
| Hidden dim | 896 |
| Heads | 14 |
| Image attention | Every 4th layer (full SDPA + 2D RoPE), Q/K RMSNorm |
| Cross attention | Sana-style, Q/K RMSNorm |
| Patch size | 1 β one token per latent pixel (256 tokens @ 512px, 1024 tokens @ 1024px) |
| Text encoder | Qwen3.5-2B (Qwen/Qwen3.5-2B), 2048-dim, up to 320 tokens |
| VAE | DC-AE f32c32 (mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers) |
| Sampler | STORK-2, 32 steps |
| Dtype | bfloat16 |
Performance
Footprint is dominated by the text encoder; the denoiser itself is small (linear-attention backbone, near-flat with batch size). Approximate resident VRAM:
| Component | VRAM |
|---|---|
| DiT weights (EMA, bf16) | ~1.4 GB |
| Qwen3.5-2B text encoder | ~4.6 GB |
| DC-AE VAE | ~0.6 GB |
| All loaded (resident) | ~7 GB |
For best speed, install the fused linear-attention kernels (see below) β without them the GDN mixer falls back to a slower torch reference path (same outputs, lower throughput). Pre-encoding prompts and keeping the text encoder off the GPU drops resident VRAM to roughly the DiT + VAE (~2 GB).
Usage
Requires
diffusers >= 0.38.0β earlier versions have atrust_remote_codeRCE (advisory). For production, pin a commit hash withrevision=so the remote code can't change under you.
Install
pip install -U "diffusers>=0.38.0" transformers accelerate safetensors torchvision scipy
pip install git+https://github.com/fla-org/flash-linear-attention.git
pip install causal-conv1d # optional, enables the fused GDN fast path
Generate
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"Akrao9/Linarix-v1",
custom_pipeline="pipeline_boomer",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe("a lighthouse on a rocky cliff above crashing waves at golden hour")[0]
image.save("output.png")
Optional generation parameters:
image = pipe(
"a quiet cobblestone street in an old European town, evening",
steps=32, # STORK-2 denoising steps
cfg_scale=4.0, # classifier-free guidance (4.0 recommended)
cfg_rescale=0.5, # reduces over-saturation / dark crush at higher CFG
seed=42,
)[0]
The transformer weights download from this repo. The DC-AE VAE and Qwen3.5-2B text encoder are fetched from their upstream HuggingFace repos on first use. Run hf auth login before first use.
Batched inference
Pass a list of prompts to generate a batch in one call:
images = pipe(["a lighthouse above crashing waves",
"a red fox in fresh snow",
"a steam locomotive on a stone bridge"], cfg_scale=4.0)
images[0].save("a.png"); images[1].save("b.png")
For throughput, two things matter:
- Install the fused kernels β without
flash-linear-attention(and optionallyflash-attn) the GDN mixer and the full-attention layers run slow, memory-heavy PyTorch fallbacks, so both speed and peak VRAM scale far worse than they should. - VAE slicing is on by default (decodes one image at a time) so batched-decode memory stays flat. Toggle the standard diffusers methods if you need to:
pipe.enable_vae_slicing()/pipe.disable_vae_slicing(), andpipe.enable_vae_tiling()for very large images on low VRAM. These apply even before the VAE is lazily loaded. - Keep components resident (
.to("cuda")) for benchmarking; CPU offload adds per-call module-shuffle overhead that dominates small batches.
Prompting tips
- Use composed, descriptive prompts (scene + subject). Bare one-word prompts can produce duplicated subjects; adding composition resolves it.
- CFG 4.0 is the sweet spot. Higher (5+) over-saturates and hardens fine detail; lower softens prompt adherence.
- For exact object counts (e.g. "a pair"), do a quick 4β6 seed sweep and pick the one that lands the count β count is set early by the seed, not by guidance.
- For faces, prefer plain descriptors; words like "freckles" can render as skin blemishes. Three-quarter / profile poses are cleaner than direct frontal close-ups.
Capabilities and limitations
Strong:
- Landscapes, natural environments, atmospheric and scenic scenes
- Rigid man-made structures β lighthouses, locomotives, stone bridges, architecture render coherently
- Portraits with composed prompts β clean faces and skin
- Still life and subject-in-scene composition
Works, with care:
- Frontal close-up faces β good, but eyes/iris are the remaining frontier; keep CFG β€ 4 and avoid blemish-trigger words
- Exact object counts β "a pair" may render three; resolve with a seed sweep
- Dense, multi-attribute prompts β spatially-localized attributes render well; occasional minor blemishes on small repeated elements
Less reliable:
- Very fine detail (iris, small faces), legible text in images
- Precise counting of repeated subjects
Other notes:
- Not safety filtered β outputs may reflect biases in the training data
- Maximum tested resolution: 1024Γ1024px
Acknowledgements
- GatedDeltaNet-2 β linear-attention backbone, via flash-linear-attention
- STORK-2 β inference sampling (Tan et al., 2025)
- DC-AE β latent autoencoder (
mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers) - Qwen3.5-2B β text encoder (
Qwen/Qwen3.5-2B) - Flow-matching with logit-normal timestep sampling and flow shift 3.0 (SANA-style)
- Downloads last month
- 180
