This project was focused on deepening my theoretical understanding of Transformer-based models such as BERT and RoBERTa, and applying that knowledge to build a comprehensive model: MiniRoBERTa.

The model consists of 17.7 million parameters and was pre-trained for 10 epochs on the WikiText dataset. Due to limited resources, the pretraining was relatively lightweight, so the base model is not yet highly capable. However, the results indicate that the model successfully learned meaningful representations and is even able to correctly predict masked tokens in certain contexts.

Model architecture:

Attention_head_dim: 32,
Hidden_size : 256,
Max_sequence_length : 128,
Num_heads: 8,
Vocab_size : 50265,
Num_Transformer_Layers: 6

Pretraining Configuration

Optimizer: Adam with betas = (0.98, 0.999), ε = 1e-6
Learning Rate: 1e-6
Scheduler: Cosine decay with 10% warmup
Weight Decay: 0.01
Epochs: 10

Downloads last month: -; Downloads are not tracked for this model. How to track

DornierDo17
/

RoBERTa_17.7M

Model architecture:

Pretraining Configuration

Model tree for DornierDo17/RoBERTa_17.7M

Dataset used to train DornierDo17/RoBERTa_17.7M

Space using DornierDo17/RoBERTa_17.7M 1