AETHER-Micro 0.5B (Phase 1 Checkpoint)

AETHER-Micro is an experimental MoE-based language model.

Model Details

Item	Value
Architecture	MoE big.LITTLE + LTL + MTP
Total Parameters	2.08B
Active Parameters	~0.5B per token
Hidden Size	1024
Layers	24
Attention	GQA 16 heads, 4 KV heads
Experts	5 Big + 15 Small + 2 Shared
Vocab Size	64,000 Korean + English + Code
Context Length	8,192 RoPE
Training Step	57,000 / 100,000
Training Loss	~3.54

Architecture Features

big.LITTLE MoE: 5 large experts (2048 intermediate) + 15 small experts (1024 intermediate) + 2 shared experts (always active)
Latent Thought Layer (LTL): K-step latent reasoning (K=0,1,2) via Gumbel-Softmax selection
Multi-Token Prediction (MTP): 4-step ahead prediction replacing standard NTP loss
Wu-Xing Router: Five-element inspired expert routing
Quality Head: 4-dimensional quality assessment

Training

Phase: 1 of 3 (57% complete)
Data: 13.1B tokens (Korean 22%, English 25%, Code 21%, Math 24%, Dialogue 8%)
Optimizer: AdamW (lr=1e-4, cosine decay)
Precision: FP32

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Be2Jay/AETHER-Micro-0.5B",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("Be2Jay/AETHER-Micro-0.5B")

Note: This is a Phase 1 training checkpoint. The model is still in early training and not yet suitable for production use.

License

Apache 2.0

Downloads last month: 6

Safetensors

Model size

2B params

Tensor type

F32