eGPT-100M-qwen3-bytes-untrained

Randomly initialized Qwen3ForCausalLM decoder-only model with a google/byt5-small tokenizer. Not trained.

All four eGPT-100M-qwen3 variants share the same 113.27M non-embedding parameters. Only the embedding / lm_head matrices differ across tokenizers (vocab size).

Architecture

Field	Value
Total Parameters	113.67M
Non-Embedding Params	113.27M
Layers	16
Hidden Size	768
Attention Heads (Q)	12
Attention Heads (KV)	12
Head Dim	64
Intermediate Size (FFN)	2048
Max Seq Len	1024
Vocab Size	256
Tokenizer	`google/byt5-small`
Activation	SwiGLU (silu)
Positional Encoding	RoPE (θ=10000.0)
QK-Norm	✅ per-head RMSNorm
Sliding Window	False

Key Qwen3 vs LLaMA difference

Qwen3 applies QK-Norm (per-head RMSNorm on query and key vectors) before the attention dot-product. This prevents attention entropy collapse during early training, yielding more stable loss curves — especially beneficial when training from scratch.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

tok   = AutoTokenizer.from_pretrained("LLMsHub/eGPT-100M-qwen3-bytes-untrained")
model = AutoModelForCausalLM.from_pretrained("LLMsHub/eGPT-100M-qwen3-bytes-untrained")

eGPT-100M-qwen3 Family

Variant	Tokenizer	Vocab Size
llama3	meta-llama/Llama-3.1-8B-Instruct	128,256
gpt2	openai-community/gpt2-large	50,257
llemma	EleutherAI/llemma_7b	32,000
bytes	google/byt5-small	256

Downloads last month: 43

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support