This project was focused on deepening my theoretical understanding of Transformer-based models such as BERT and RoBERTa, and applying that knowledge to build a comprehensive model: MiniRoBERTa.
The model consists of 17.7 million parameters and was pre-trained for 10 epochs on the WikiText dataset. Due to limited resources, the pretraining was relatively lightweight, so the base model is not yet highly capable. However, the results indicate that the model successfully learned meaningful representations and is even able to correctly predict masked tokens in certain contexts.
Model architecture:
- Attention_head_dim: 32,
- Hidden_size : 256,
- Max_sequence_length : 128,
- Num_heads: 8,
- Vocab_size : 50265,
- Num_Transformer_Layers: 6
Pretraining Configuration
- Optimizer: Adam with betas = (0.98, 0.999), ε = 1e-6
- Learning Rate: 1e-6
- Scheduler: Cosine decay with 10% warmup
- Weight Decay: 0.01
- Epochs: 10