This project was focused on deepening my theoretical understanding of Transformer-based models such as BERT and RoBERTa, and applying that knowledge to build a comprehensive model: MiniRoBERTa.

The model consists of 17.7 million parameters and was pre-trained for 10 epochs on the WikiText dataset. Due to limited resources, the pretraining was relatively lightweight, so the base model is not yet highly capable. However, the results indicate that the model successfully learned meaningful representations and is even able to correctly predict masked tokens in certain contexts.

Model architecture:

  • Attention_head_dim: 32,
  • Hidden_size : 256,
  • Max_sequence_length : 128,
  • Num_heads: 8,
  • Vocab_size : 50265,
  • Num_Transformer_Layers: 6

Pretraining Configuration

  • Optimizer: Adam with betas = (0.98, 0.999), ε = 1e-6
  • Learning Rate: 1e-6
  • Scheduler: Cosine decay with 10% warmup
  • Weight Decay: 0.01
  • Epochs: 10
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for DornierDo17/RoBERTa_17.7M

Finetunes
1 model

Dataset used to train DornierDo17/RoBERTa_17.7M

Space using DornierDo17/RoBERTa_17.7M 1