April-70M

Released 28 May 2026.

A ~69M-parameter depth-recurrent chat and tool-use model. It was retrofitted from a vanilla Llama base into a Huginn/Raven-style looped transformer and post-trained at 16K context on a mix of web text, chat, and tool-calling data.

The recurrent core block can be iterated a variable number of times at inference (num_steps), so you can spend more compute on harder inputs without changing the parameter count.

It's a small research model. It handles conversational format, simple facts, and tool-call structure well, but it's not a reliable knowledge base and it can derail on open-ended creative prompts. Treat its factual claims with suspicion.

Architecture

Converted from SupraLabs/Supra-50M-Base with mcleish7/retrofitting-recurrence, which reorganizes the layers into prelude → looped core → coda (4 layers each). The prelude and coda run once; the core runs num_steps times, so effective depth at num_steps=k is roughly 4 + 4k + 4 layers.

The conversion is weight-faithful: at num_steps=1 it reproduces the original Llama's logits exactly. Hidden size 512, 32003 vocab, untied embeddings.

Context length

Trained at 16K tokens (max_position_embeddings=16384, rope_theta=500000). It will accept somewhat longer inputs via RoPE extrapolation but degrades past about 1.5–2x the trained length.

Tokenizer and chat template

The base tokenizer was extended with the Granite-4 chat-template special tokens (<|start_of_role|>, <|end_of_role|>, <|end_of_text|>), and uses the ibm-granite/granite-4.1-30b chat template (lightly patched so assistant-token masking includes the trailing EOS, which is what teaches the model to stop). It supports system/user/assistant/tool roles, tool schemas, and <tool_call> / <tool_response> blocks.

Training data

Roughly 182M tokens, four sources interleaved and packed into 16K-token rows:

On chat and tool data, loss is computed on assistant tokens only (including EOS); on raw text, loss is on all non-pad tokens. The only text modification was stripping <think>...</think> reasoning blocks and surrounding whitespace. Tool calls and tool responses are kept intact.

A note on the mix: those are per-example sampling probabilities, so the tool slice is somewhat larger in token terms because ToolMind sessions are longer. The smaller chat sets repeat several times over the run, so some memorization is expected.

Training recipe

Single L4 GPU, AdamW at lr 5e-5, WSD schedule, 16K context. One full epoch (5,550 optimizer updates) at the stable learning rate with a recurrence curriculum ramping mean loops 1 → 4 over the first quarter, then a 1,000-update cooldown warm-started from the update-5,400 checkpoint with the learning rate decayed to near zero. About 215M tokens seen total. Final training loss settled around 2.14, near this model's floor at 69M parameters on a partly repeated corpus.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# needs transformers==4.51.0 (the version the model was built against)
tok = AutoTokenizer.from_pretrained("breitburg/april-70m", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "breitburg/april-70m", trust_remote_code=True, torch_dtype=torch.float32
).eval()

msgs = [{"role": "user", "content": "What is the capital of France?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt")
out = model.generate(ids, max_new_tokens=64, num_steps=4, eos_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=False))

For tool use, pass tools=[...] (OpenAI-style JSON schemas) to apply_chat_template; the model emits a <tool_call>{...}</tool_call> when it decides to call one. Try num_steps of 1, 2, 4, or 8 to trade speed for depth.

Limitations

Weak factual recall and reasoning, English-only. Long-context is limited by the training data (most documents are far shorter than 16K), not just by RoPE. Tool-call triggering can over- or under-fire on out-of-distribution prompts.

Lineage

SupraLabs/Supra-50M-Base → retrofit to depth-recurrent (4-4-4) → extend tokenizer and resize embeddings → post-train at 16K on the mix above.

Downloads last month
3
Safetensors
Model size
68.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for breitburg/april-70m-280526

Finetuned
(1)
this model

Collection including breitburg/april-70m-280526