eGPT-100M-qwen3-bytes-untrained

Randomly initialized Qwen3ForCausalLM decoder-only model with a google/byt5-small tokenizer. Not trained.

All four eGPT-100M-qwen3 variants share the same 113.27M non-embedding parameters. Only the embedding / lm_head matrices differ across tokenizers (vocab size).

Architecture

Field Value
Total Parameters 113.67M
Non-Embedding Params 113.27M
Layers 16
Hidden Size 768
Attention Heads (Q) 12
Attention Heads (KV) 12
Head Dim 64
Intermediate Size (FFN) 2048
Max Seq Len 1024
Vocab Size 256
Tokenizer google/byt5-small
Activation SwiGLU (silu)
Positional Encoding RoPE (θ=10000.0)
QK-Norm ✅ per-head RMSNorm
Sliding Window False

Key Qwen3 vs LLaMA difference

Qwen3 applies QK-Norm (per-head RMSNorm on query and key vectors) before the attention dot-product. This prevents attention entropy collapse during early training, yielding more stable loss curves — especially beneficial when training from scratch.

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

tok   = AutoTokenizer.from_pretrained("LLMsHub/eGPT-100M-qwen3-bytes-untrained")
model = AutoModelForCausalLM.from_pretrained("LLMsHub/eGPT-100M-qwen3-bytes-untrained")

eGPT-100M-qwen3 Family

Variant Tokenizer Vocab Size
llama3 meta-llama/Llama-3.1-8B-Instruct 128,256
gpt2 openai-community/gpt2-large 50,257
llemma EleutherAI/llemma_7b 32,000
bytes google/byt5-small 256
Downloads last month
43
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support