You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Linarix-v1 — Text-to-Image

Linarix-v1 is a ~712M parameter text-to-image diffusion prototype model that generates 1024×1024px images from text prompts.

Instead of standard quadratic self-attention, it uses GatedDeltaNet-2 (GDN-2) — a bidirectional Flash Linear Attention mixer with decoupled channel-wise erase/write gates — as the backbone of its transformer blocks. This keeps memory roughly flat with sequence length. Every 4th block adds a full SDPA layer with 2D RoPE for global spatial coherence, and both the image and cross-attention paths use Q/K RMSNorm to cap attention-logit growth.

Text conditioning uses Qwen3.5-2B (2048-dim embeddings, up to 320 tokens) read through an instruction prompt and chat template. Decoding uses the DC-AE f32c32 VAE with 32× spatial compression, producing 32×32 latents from 1024px images. Inference uses STORK-2 flow-matching sampling (Tan et al., 2025).

Pre-trained on a ~5M-image 512px latent mix, then fine-tuned on ~1024px FineT2I latents.

Sample Outputs

Generated at 1024×1024px, STORK-2, 32 steps, CFG 4.0. Prompts (left→right, top→bottom):

a misty pine forest at dawn, sunbeams through the trees
a calm lake reflecting autumn mountains, golden hour
a lavender field under a dramatic stormy sky
a lighthouse on a rocky cliff above crashing waves
a close-up portrait of a young woman, natural window light
an old craftsman working at a wooden workbench, warm lamplight
a red fox standing in fresh snow, looking at the camera
a pair of swans on a still pond at sunrise
a steam locomotive crossing a stone bridge
a quiet cobblestone street in an old European town, evening
a rustic bowl of strawberries on a linen cloth

Architecture

Property	Value
Parameters	~712M
Backbone	Bidirectional GatedDeltaNet-2 (Flash Linear Attention)
Depth	24 layers (first 2 dual-stream)
Hidden dim	896
Heads	14
Image attention	Every 4th layer (full SDPA + 2D RoPE), Q/K RMSNorm
Cross attention	Sana-style, Q/K RMSNorm
Patch size	1 — one token per latent pixel (256 tokens @ 512px, 1024 tokens @ 1024px)
Text encoder	Qwen3.5-2B (`Qwen/Qwen3.5-2B`), 2048-dim, up to 320 tokens
VAE	DC-AE f32c32 (`mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers`)
Sampler	STORK-2, 32 steps
Dtype	bfloat16

Performance

Footprint is dominated by the text encoder; the denoiser itself is small (linear-attention backbone, near-flat with batch size). Approximate resident VRAM:

Component	VRAM
DiT weights (EMA, bf16)	~1.4 GB
Qwen3.5-2B text encoder	~4.6 GB
DC-AE VAE	~0.6 GB
All loaded (resident)	~7 GB

For best speed, install the fused linear-attention kernels (see below) — without them the GDN mixer falls back to a slower torch reference path (same outputs, lower throughput). Pre-encoding prompts and keeping the text encoder off the GPU drops resident VRAM to roughly the DiT + VAE (~2 GB).

Usage

Requires diffusers >= 0.38.0 — earlier versions have a trust_remote_code RCE (advisory). For production, pin a commit hash with revision= so the remote code can't change under you.

Install

pip install -U "diffusers>=0.38.0" transformers accelerate safetensors torchvision scipy
pip install git+https://github.com/fla-org/flash-linear-attention.git
pip install causal-conv1d   # optional, enables the fused GDN fast path

Generate

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "Akrao9/Linarix-v1",
    custom_pipeline="pipeline_boomer",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe("a lighthouse on a rocky cliff above crashing waves at golden hour")[0]
image.save("output.png")

Optional generation parameters:

image = pipe(
    "a quiet cobblestone street in an old European town, evening",
    steps=32,        # STORK-2 denoising steps
    cfg_scale=4.0,   # classifier-free guidance (4.0 recommended)
    cfg_rescale=0.5, # reduces over-saturation / dark crush at higher CFG
    seed=42,
)[0]

The transformer weights download from this repo. The DC-AE VAE and Qwen3.5-2B text encoder are fetched from their upstream HuggingFace repos on first use. Run hf auth login before first use.

Batched inference

Pass a list of prompts to generate a batch in one call:

images = pipe(["a lighthouse above crashing waves",
               "a red fox in fresh snow",
               "a steam locomotive on a stone bridge"], cfg_scale=4.0)
images[0].save("a.png"); images[1].save("b.png")

For throughput, two things matter:

Install the fused kernels — without flash-linear-attention (and optionally flash-attn) the GDN mixer and the full-attention layers run slow, memory-heavy PyTorch fallbacks, so both speed and peak VRAM scale far worse than they should.
VAE slicing is on by default (decodes one image at a time) so batched-decode memory stays flat. Toggle the standard diffusers methods if you need to: pipe.enable_vae_slicing() / pipe.disable_vae_slicing(), and pipe.enable_vae_tiling() for very large images on low VRAM. These apply even before the VAE is lazily loaded.
Keep components resident (.to("cuda")) for benchmarking; CPU offload adds per-call module-shuffle overhead that dominates small batches.

Prompting tips

Use composed, descriptive prompts (scene + subject). Bare one-word prompts can produce duplicated subjects; adding composition resolves it.
CFG 4.0 is the sweet spot. Higher (5+) over-saturates and hardens fine detail; lower softens prompt adherence.
For exact object counts (e.g. "a pair"), do a quick 4–6 seed sweep and pick the one that lands the count — count is set early by the seed, not by guidance.
For faces, prefer plain descriptors; words like "freckles" can render as skin blemishes. Three-quarter / profile poses are cleaner than direct frontal close-ups.

Capabilities and limitations

Strong:

Landscapes, natural environments, atmospheric and scenic scenes
Rigid man-made structures — lighthouses, locomotives, stone bridges, architecture render coherently
Portraits with composed prompts — clean faces and skin
Still life and subject-in-scene composition

Works, with care:

Frontal close-up faces — good, but eyes/iris are the remaining frontier; keep CFG ≤ 4 and avoid blemish-trigger words
Exact object counts — "a pair" may render three; resolve with a seed sweep
Dense, multi-attribute prompts — spatially-localized attributes render well; occasional minor blemishes on small repeated elements

Less reliable:

Very fine detail (iris, small faces), legible text in images
Precise counting of repeated subjects

Other notes:

Not safety filtered — outputs may reflect biases in the training data
Maximum tested resolution: 1024×1024px

Acknowledgements

GatedDeltaNet-2 — linear-attention backbone, via flash-linear-attention
STORK-2 — inference sampling (Tan et al., 2025)
DC-AE — latent autoencoder (mit-han-lab/dc-ae-f32c32-sana-1.1-diffusers)
Qwen3.5-2B — text encoder (Qwen/Qwen3.5-2B)
Flow-matching with logit-normal timestep sampling and flow shift 3.0 (SANA-style)

Downloads last month: 180

Space using akrao9/Linarix-v1 1

Paper for akrao9/Linarix-v1

STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence

Paper • 2505.24210 • Published Oct 1, 2025