GPU Efficiency in VLAI Model Training

Community Article Published December 14, 2025

Experiences and Benchmarks from Months of VLAI Vulnerability Severity Classification Model Training

Preface

This document summarizes the benchmarking, training configuration, and performance results obtained while generating the Vulnerability Severity Classification model across different GPU architectures throughout 2025.

The VLAI Vulnerability Severity Classification model developed at CIRCL is regularly updated and shared on Hugging Face. It has been presented in:

Bonhomme, C., & Dulaunoy, A. (2025). VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification (Version 1.4.0) [Computer software].
https://doi.org/10.48550/arXiv.2507.03607

All materials used to produce this technical report—including Matplotlib scripts, datasets, and other resources—are available in the Git repository:
https://github.com/vulnerability-lookup/gpu-vuln-bench

This report is also available as a PDF.

Environments used for benchmarking

GPU Architectures

The performance benchmarks were conducted on the GPU-accelerated systems described in the table 1. Each environment varies in CPU architecture, GPU type, and memory capacity, enabling us to evaluate model training efficiency across different hardware configurations.

Table 1: GPU-accelerated systems used for benchmarking in different environments.

Env CPU GPU RAM Location
A 64 (AMD EPYC 9124 16-Core Processor) 2 × NVIDIA L40S 251.5 GB CIRCL Server Lab (Luxembourg City)
B 224 (Intel Xeon Platinum 8480+) 2 × NVIDIA H100 NVL 2,014 GB LuxConnect Datacenter
C 224 (Intel Xeon Platinum 8480+) 4 × NVIDIA L40S 2,014 GB LuxConnect Datacenter

Each environment was used to execute a series of experiments designed to measure the throughput, memory utilization, and training time of the VLAI Vulnerability Severity Classification model. The following sections provide a detailed summary and analysis of these experiments.

Framework Versions

Main software and libraries used during the experiences:

  • Python: 3.12.3
  • Transformers: 4.57.1
  • PyTorch: 2.9.1+cu128
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Dataset

The dataset used for training and evaluation is available on Hugging Face at the commit 2135755d8f42902de065d1ca30d800820b1e5cf1.

https://huggingface.co/datasets/CIRCL/vulnerability-scores

This is the updated version of the dataset referenced in arXiv.2507.03607.

Dataset statistics:

The test split accounts for 10% of the dataset and can be configured in VulnTrain (https://github.com/vulnerability-lookup/VulnTrain).

VulnTrain is developed as part of the AIPITCH project and is integrated with Vulnerability-Lookup via ML-Gateway—a FastAPI-based local server that loads one or more pre-trained NLP models at startup and exposes them through a clean, RESTful API for inference.
For more details, see: https://github.com/vulnerability-lookup/ML-Gateway.

This dataset is periodically updated with data collected with Vulnerability-Lookup.

Model Training

Resulting models

The main model is available on Hugging Face:
https://huggingface.co/CIRCL/vulnerability-severity-classification-roberta-base

It is a fine-tuned version of RoBERTa-base trained on the CIRCL/vulnerability-scores dataset.
Intermediate models are also available on Hugging Face and are versioned for reproducibility:

The code of the trainer is available in the VulnTrain project.

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning rate: 3e-05
  • Per device Batch Size: 8
  • Seed: 42
  • Optimizer: ADAMW_TORCH_FUSED
  • Scheduler: linear
  • Epochs: 5

For a RoBERTa model, the default batch size per device we chose is 8.

RoBERTa-base is a medium-sized Transformer model (approx. 125 million parameters). A batch size of 8 per device is a standard, conservative choice that is un likely to cause Out-of-Memory (OOM) errors on most modern GPUs (like NVIDIA V100, A100, or even modern consumer cards like the RTX 3080/4080) for typical sequence lengths (e.g., 128 or 256 tokens).

3e-05 is a standard and safe learning rate for fine-tuning RoBERTa, with the optimizer using its default settings.

A quick note on epochs and batches

A batch is a subset of the training data processed together in one forward and backward pass, producing gradients that update the model weights.
The batch size is the number of samples in that batch.

An epoch is one full pass over the entire training dataset.
Since the dataset is divided into batches, an epoch consists of multiple steps, where each step processes one batch and updates the model weights.

The effective batch size (batch size × number of GPUs) influences training dynamics:

  • Larger effective batches produce more stable gradients, require fewer optimization steps per epoch, and often converge faster.
  • Smaller batches introduce noise in the gradients, which can help escape poor local minima and improve generalization, but each epoch takes longer.
  • The impact on generalization also depends on using an appropriate learning rate.

RoBERTa often benefits from slightly larger batches. For example, using a batch of 32 samples per step can reduce gradient noise and stabilize learning, leading to quicker convergence.

Figure 1: Number of GPUs / Batch Size - Illustration 1

# GPUs / Batch Size - Illustration 1

Figure 2: Number of GPUs / Batch Size - Illustration 2

# GPUs / Batch Size - Illustration 2

Each colored rectangle represents a single training step, corresponding to one processed batch. An epoch ends once steps_per_epoch steps have been completed.

In our case, the training split contains 577,872 samples. The visualizations use a simplified view to illustrate the concepts more clearly for learning purposes. They illustrate how batch size, number of GPUs, and dataset size affect the number of training steps per epoch.

Training results

Environment Final Loss Final Accuracy Epochs to Converge Batch Size Steps per Epoch
A 0.2537 0.8232 5 16 29470
B 0.2801 0.8230 5 16 29470
C 0.3793 0.8173 5 32 14735

Table 2: Final training results for the different environments.

Results in terms of loss and accuracy are very similar, regardless of the system used.
Each experiment produced slightly different rankings, but the differences are minimal.

The samples per epoch is the same in each environments: 577,872. Wich corresponds to 10 per cent of the dataset .

Environment A

Theoretically, samples_per_epoch should match the number of samples in the training split (577,872), but our trainer filters out entries with missing or unknown severity labels. As previously explained an epoch is one full pass over the entire training dataset.

Training Loss Epoch Step Validation Loss Accuracy steps_per_epoch samples_per_epoch
0.4999 1.0 29470 0.6657 0.7290 29470.0 471520.0
0.5279 2.0 58940 0.5911 0.7685 29470.0 471520.0
0.4775 3.0 88410 0.5392 0.7961 29470.0 471520.0
0.3753 4.0 117880 0.5125 0.8122 29470.0 471520.0
0.2537 5.0 147350 0.5169 0.8232 29470.0 471520.0

Table 3: Training results for an experiment with environment A

Environment B

Training Loss Epoch Step Validation Loss Accuracy steps_per_epoch samples_per_epoch
0.5379 1.0 29470 0.6573 0.7358 29470.0 471520.0
0.5714 2.0 58940 0.5810 0.7710 29470.0 471520.0
0.4636 3.0 88410 0.5412 0.7918 29470.0 471520.0
0.4738 4.0 117880 0.5098 0.8131 29470.0 471520.0
0.2801 5.0 147350 0.5175 0.8230 29470.0 471520.0

Table 4: Training results for an experiment with environment B

Environment C

Training Loss Epoch Step Validation Loss Accuracy steps_per_epoch samples_per_epoch
0.6270 1.0 14735 0.6594 0.7298 14735.0 471520.0
0.5675 2.0 29470 0.5780 0.7693 14735.0 471520.0
0.4690 3.0 44205 0.5363 0.7930 14735.0 471520.0
0.4373 4.0 58940 0.5069 0.8107 14735.0 471520.0
0.3793 5.0 73675 0.5071 0.8173 14735.0 471520.0

Table 5: Training results for an experiment with environment C

Note that $$147350 / 2 = 73675$$

Comparisons

Cumulative Samples vs Steps

A common rule of thumb is the linear scaling rule: when the effective batch size is doubled, the learning rate is also doubled.
This behavior is confirmed in all of our experiments.

Validation Accuracy per Epoch

The chart shows the validation accuracy per epoch for the various experiments with the environments A, B, and C.
All experiments exhibit very similar accuracy trends.
Experiments in environment C reaches higher accuracy more quickly in the early epochs, reflecting faster convergence per epoch due to a larger effective batch size (more GPUs × batch per device).
By the final epoch, all experiments achieve comparable accuracy (~0.82), indicating consistent model performance across the different setups.

Key Observations

  • More GPUs → larger effective batch → fewer steps per epoch
    • Example:
      • 4 GPUs × 256 samples → 1024 samples/step → fewer steps to process the full dataset
      • 2 GPUs × 256 samples → 612 samples/step → more steps to process the same dataset
  • Larger batch size per device → fewer steps per epoch, but each step processes more data.
  • Epoch duration is proportional to number of steps × time per step, so increasing GPUs or batch size reduces total training time per epoch.

Figure 1 and 2 make it easier to understand why Exp C (4 GPUs, batch size 8 per device → effective batch 32) completes fewer steps per epoch and thus runs faster per epoch than Exp A/B (2 GPUs, effective batch 16), even though the dataset and model are identical.

Benchmark Comparisons Across Different Environments

Duration

Duration

Energy

Energy breakdown comparison

Energy consumption comparison

CPU/GPU/RAM Energy breakdown

Emissions

Emissions comparison

GPU Power

GPU power

Energy vs. Duration

Energy vs duration

GPU Power vs. Duration

GPU power vs duration

GPU Power vs. Energy

GPU power vs energy

Conclusion

From our perspective, Environment C offers the best balance of performance and energy efficiency. The quality of the resulting model is not significantly affected by these small variations in batch size, and may in fact remain completely unchanged. We plan to explore additional configurations in the future using our new equipment.

Evolution of Experiments in Environment A

We have been collecting data since February 2025 in Environment A, which is equipped with 2 × NVIDIA L40S GPUs. The charts below illustrate the evolution of our experiments over the course of the year. (Environments B and C are too recent to provide meaningful data at this time.)

Evolution of the duration

Evolution of the energy consumption

Evolution of the GPU power used

The workload did not change enough to explain the summer peak:

  • The dataset size shows a nearly linear and steady growth.
  • We did not change the training hyperparameters or the base model (model size) in this configuration.
  • No changes were made to the GPU configuration.

Our hypothesis is thermal throttling and cooling overhead.
CodeCarbon estimates total energy consumption using the PUE (Power Usage Effectiveness) of the environment.
If PUE increases during summer due to higher cooling requirements, the estimated energy usage rises, even if the GPU workload remains identical.

When ambient temperatures increase, hardware may:

  • throttle its operating frequency,
  • reduce performance,
  • complete the same training steps over a longer duration.

As a result, even if instantaneous power consumption remains similar, the overall job duration increases, which leads to a higher total energy consumption (more Joules).

It must be noted that Environment A is located in the CIRCL Server Lab in Luxembourg City, where temperature is not controlled as strictly as in a datacenter.

We will monitor temperature and environmental metrics in future experiments to quantify these effects more precisely.

Future Works

The acquisition of our new equipment will allow us to conduct more experiments across a variety of configurations, enabling larger and more complex model training, which could have a greater impact (negative or positive) on model accuracy.

As a first demonstration, we recently developed a text generation model designed to assist in writing vulnerability descriptions. This is a fine-tuned version of GPT-2 XL, the 1.5B parameter variant of GPT-2. The model is available here: https://huggingface.co/CIRCL/vulnerability-description-generation-gpt2-xl

The model was trained in Environment C over approximately 34 hours. Training in Environment A was not feasible, even with the standard GPT-2 model, due to GPU memory limitations.

In addition, we plan to improve our CWE classification model using the vulnerability patches we have collected (https://huggingface.co/datasets/CIRCL/vulnerability-cwe-patch).

We also plan to experiment with a RAG (Retrieval-Augmented Generation) system, which combines retrieval from a knowledge base with generative models to produce answers. This approach is particularly suited for domain-specific information, in our case software vulnerabilities. Alternatively, we may explore a Question-Answering (QA) system, focused on providing factual answers directly from our dataset.

Resources

Related to CodeCarbon's RAM Energy Calculation

CodeCarbon primarily calculates the energy used by RAM through a power consumption model based on estimations, rather than direct hardware measurement, unless specific system features are available.

The power estimation for a "large server" is approximately 40W (using 8x128GB DIMMs with high efficiency scaling).

Reference: https://mlco2.github.io/codecarbon/methodology.html#ram

Estimation Methodology

The default method relies on a fixed power consumption value per installed RAM module (DIMM):

  1. Fixed Power per DIMM: A standardized, average power consumption value is assigned to each RAM module.
    • For x86 Systems (most standard laptops/desktops), this is typically set at 5 Watts per DIMM.
    • For ARM Systems (e.g., Raspberry Pi), a lower base power, like 1.5W per DIMM, or a constant of 3W, is used.
  2. Counting RAM Modules: CodeCarbon attempts to determine the number of installed RAM modules (DIMMs) on the system by querying the operating system.
  3. Total Power Calculation: The estimated total RAM power is calculated by multiplying these two values: $$\text{RAM Power (Watts)} = \text{Fixed Power per DIMM} \times \text{Number of RAM Slots Used}$$
  4. Scaling (for Servers): For systems with many DIMMs (e.g., servers with 8+ slots), a scaling factor is applied to reduce the power assigned to each additional DIMM, acknowledging that power consumption doesn't increase strictly linearly in large configurations.

Energy Calculation

Once the estimated RAM Power (in Watts) is determined, the Energy Consumed (in kilowatt-hours, or kWh) is calculated based on the duration of the code execution:

Energy (kWh)=Power (Watts)×Time (hours)1000\text{Energy (kWh)} = \frac{\text{Power (Watts)} \times \text{Time (hours)}}{1000}

Direct Measurement Alternative

On Linux systems, CodeCarbon offers a more accurate method with the Intel Running Average Power Limit (RAPL) interface.

  • If the rapl_include_dram parameter is set to True, CodeCarbon will attempt to use the direct power measurement for the DRAM (memory subsystem) provided by RAPL, overriding the fixed power estimation model. This method offers the most precise consumption data when available.

Reference: https://mlco2.github.io/codecarbon/parameters.html

Related to CodeCarbon's GPU Energy Calculation

The energy consumption is tracked using nvidia-ml-pylibrary.

Reference: https://mlco2.github.io/codecarbon/methodology.html#gpu

Environmental Considerations

Our server room is hosted in LuxConnect’s data centers, which are powered entirely by renewable energy (https://www.luxconnect.lu/infrastructure).

Litterature

Bonhomme, C., & Dulaunoy, A. (2025). VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification (Version 1.4.0) [Computer software].
https://doi.org/10.48550/arXiv.2507.03607

Feedback

Feel free to share your feedback at [email protected] or publicly:
https://github.com/vulnerability-lookup/gpu-vuln-bench/issues

Funding

EU Funding

AIPITCH aims to create advanced artificial intelligence-based tools supporting key operational services in cyber defense. These include technologies for early threat detection, automatic malware classification, and improvement of analytical processes through the integration of Large Language Models (LLM). The project has the potential to set new standards in the cybersecurity industry.

The project leader is NASK National Research Institute. The international consortium includes:

  • CIRCL (Computer Incident Response Center Luxembourg), Luxembourg
  • The Shadowserver Foundation, Netherlands
  • NCBJ (National Centre for Nuclear Research), Poland
  • ABI LAB (Centre of Research and Innovation for Banks), Italy

Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Cybersecurity Competence Centre. Neither the European Union nor the European Cybersecurity Competence Centre can be held responsible for them.

Community

Sign up or log in to comment