Instructions to use breitburg/april-70m-280526 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use breitburg/april-70m-280526 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="breitburg/april-70m-280526", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("breitburg/april-70m-280526", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use breitburg/april-70m-280526 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "breitburg/april-70m-280526"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "breitburg/april-70m-280526",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/breitburg/april-70m-280526

SGLang

How to use breitburg/april-70m-280526 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "breitburg/april-70m-280526" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "breitburg/april-70m-280526",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "breitburg/april-70m-280526" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "breitburg/april-70m-280526",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use breitburg/april-70m-280526 with Docker Model Runner:
```
docker model run hf.co/breitburg/april-70m-280526
```

April-70M

Released 28 May 2026.

A ~69M-parameter depth-recurrent chat and tool-use model. It was retrofitted from a vanilla Llama base into a Huginn/Raven-style looped transformer and post-trained at 16K context on a mix of web text, chat, and tool-calling data.

The recurrent core block can be iterated a variable number of times at inference (num_steps), so you can spend more compute on harder inputs without changing the parameter count.

It's a small research model. It handles conversational format, simple facts, and tool-call structure well, but it's not a reliable knowledge base and it can derail on open-ended creative prompts. Treat its factual claims with suspicion.

Architecture

Converted from SupraLabs/Supra-50M-Base with mcleish7/retrofitting-recurrence, which reorganizes the layers into prelude → looped core → coda (4 layers each). The prelude and coda run once; the core runs num_steps times, so effective depth at num_steps=k is roughly 4 + 4k + 4 layers.

The conversion is weight-faithful: at num_steps=1 it reproduces the original Llama's logits exactly. Hidden size 512, 32003 vocab, untied embeddings.

Context length

Trained at 16K tokens (max_position_embeddings=16384, rope_theta=500000). It will accept somewhat longer inputs via RoPE extrapolation but degrades past about 1.5–2x the trained length.

Tokenizer and chat template

The base tokenizer was extended with the Granite-4 chat-template special tokens (<|start_of_role|>, <|end_of_role|>, <|end_of_text|>), and uses the ibm-granite/granite-4.1-30b chat template (lightly patched so assistant-token masking includes the trailing EOS, which is what teaches the model to stop). It supports system/user/assistant/tool roles, tool schemas, and <tool_call> / <tool_response> blocks.

Training data

Roughly 182M tokens, four sources interleaved and packed into 16K-token rows:

30% openbmb/Ultra-FineWeb (raw text, loss on all tokens)
23.3% Roman1111111/claude-opus-4.6-10000x (chat)
23.3% HuggingFaceTB/everyday-conversations-llama3.1-2k (chat)
23.3% Nanbeige/ToolMind (tool-calling, with tool schemas)

On chat and tool data, loss is computed on assistant tokens only (including EOS); on raw text, loss is on all non-pad tokens. The only text modification was stripping <think>...</think> reasoning blocks and surrounding whitespace. Tool calls and tool responses are kept intact.

A note on the mix: those are per-example sampling probabilities, so the tool slice is somewhat larger in token terms because ToolMind sessions are longer. The smaller chat sets repeat several times over the run, so some memorization is expected.

Training recipe

Single L4 GPU, AdamW at lr 5e-5, WSD schedule, 16K context. One full epoch (5,550 optimizer updates) at the stable learning rate with a recurrence curriculum ramping mean loops 1 → 4 over the first quarter, then a 1,000-update cooldown warm-started from the update-5,400 checkpoint with the learning rate decayed to near zero. About 215M tokens seen total. Final training loss settled around 2.14, near this model's floor at 69M parameters on a partly repeated corpus.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# needs transformers==4.51.0 (the version the model was built against)
tok = AutoTokenizer.from_pretrained("breitburg/april-70m", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "breitburg/april-70m", trust_remote_code=True, torch_dtype=torch.float32
).eval()

msgs = [{"role": "user", "content": "What is the capital of France?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt")
out = model.generate(ids, max_new_tokens=64, num_steps=4, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=False))

For tool use, pass tools=[...] (OpenAI-style JSON schemas) to apply_chat_template; the model emits a <tool_call>{...}</tool_call> when it decides to call one. Try num_steps of 1, 2, 4, or 8 to trade speed for depth.

Limitations

Weak factual recall and reasoning, English-only. Long-context is limited by the training data (most documents are far shorter than 16K), not just by RoPE. Tool-call triggering can over- or under-fire on out-of-distribution prompts.

Lineage

SupraLabs/Supra-50M-Base → retrofit to depth-recurrent (4-4-4) → extend tokenizer and resize embeddings → post-train at 16K on the mix above.

Downloads last month: 3

Safetensors

Model size

68.7M params

Tensor type

F32

Model tree for breitburg/april-70m-280526

Base model

SupraLabs/Supra-50M-Base

Finetuned

(1)

this model

Collection including breitburg/april-70m-280526

April

Collection

Ultra-compact Recurrent Transformers • 1 item • Updated 2 days ago