eGPT-100M-qwen3-bytes-untrained
Randomly initialized Qwen3ForCausalLM decoder-only model with a google/byt5-small tokenizer.
Not trained.
All four eGPT-100M-qwen3 variants share the same 113.27M non-embedding parameters. Only the embedding / lm_head matrices differ across tokenizers (vocab size).
Architecture
| Field | Value |
|---|---|
| Total Parameters | 113.67M |
| Non-Embedding Params | 113.27M |
| Layers | 16 |
| Hidden Size | 768 |
| Attention Heads (Q) | 12 |
| Attention Heads (KV) | 12 |
| Head Dim | 64 |
| Intermediate Size (FFN) | 2048 |
| Max Seq Len | 1024 |
| Vocab Size | 256 |
| Tokenizer | google/byt5-small |
| Activation | SwiGLU (silu) |
| Positional Encoding | RoPE (θ=10000.0) |
| QK-Norm | ✅ per-head RMSNorm |
| Sliding Window | False |
Key Qwen3 vs LLaMA difference
Qwen3 applies QK-Norm (per-head RMSNorm on query and key vectors) before the attention dot-product. This prevents attention entropy collapse during early training, yielding more stable loss curves — especially beneficial when training from scratch.
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("LLMsHub/eGPT-100M-qwen3-bytes-untrained")
model = AutoModelForCausalLM.from_pretrained("LLMsHub/eGPT-100M-qwen3-bytes-untrained")
eGPT-100M-qwen3 Family
- Downloads last month
- 43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support