Qwen2.5-1.5B — Slips IDS Security Summarization

Model Description

A fine-tuned version of Qwen2.5-1.5B-Instruct specialized for translating technical network security events from Slips IDS into clear, human-readable incident summaries with severity assessments.

Slips is a network intrusion detection system that generates DAG-structured alert logs — chains of related security events per source IP per time window. Raw Slips output is highly technical and difficult to interpret quickly. This model translates those logs into structured, concise summaries grouped by event type, with per-event severity labels (CRITICAL / HIGH / MEDIUM / LOW / INFO) and an overall severity breakdown.

The model was fine-tuned using SFT (Supervised Fine-Tuning) with best-of-N response selection: for each training incident, the highest-scoring response (judged by an LLM-as-judge) among GPT-4o, GPT-4o-mini, Qwen2.5 3B, and Qwen2.5 was selected as ground truth.

Quick Start

Ollama (recommended for local deployment)

ollama run stratosphere/qwen2.5-1.5b-slips-immune-summarization
# or a specific quantization:
ollama run stratosphere/qwen2.5-1.5b-slips-immune-summarization:q5_k_m
ollama run stratosphere/qwen2.5-1.5b-slips-immune-summarization:q8_0

Python (Transformers)

This model uses a merged prompt format: instructions and the DAG are combined into a single user message with no system prompt.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "stratosphere/qwen2.5-1.5b-slips-immune-summarization"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

dag_input = """
============================================================
Incident: abc123
Source IP: 192.168.1.100 | Timewindow: 5
Timeline: 2024-01-15 14:00:00 to 2024-01-15 15:00:00
Threat Level: 8.5 | Events: 42
...
"""  # paste your Slips DAG analysis here

user_message = f"""You are a security analyst. Your task is to translate technical security events into clear, concise, human-readable summaries and assess their severity.

INCIDENT METADATA:
- Incident ID: abc123
- Source IP: 192.168.1.100
- Timewindow: 5
- Accumulated Threat Level: 8.5
- Time Range: 2024-01-15 14:00:00 to 2024-01-15 15:00:00
- Total Events: 42

RAW EVENTS (Time | Description):
{dag_input}

YOUR TASK:
1. Transform the technical event descriptions into clear, readable summaries using plain language
2. Group identical or very similar events (e.g., 24 identical connections → one summary line)
3. Assess the severity of each event/group based on security impact:
   - CRITICAL: Active exploitation, data exfiltration, confirmed malware C2
   - HIGH: Scanning, suspicious connections, potential threats
   - MEDIUM: Anomalous but potentially benign behavior
   - LOW: Minor issues, likely false positives
   - INFO: Informational events, normal network behavior
4. Calculate the overall severity breakdown based on your assessments

OUTPUT FORMAT (match this structure exactly):

============================================================
Incident: <incident_id>
Source IP: <source_ip> | Timewindow: <timewindow>
Timeline: <start> to <end>
Threat Level: <threat_level> | Events: <count>

• HH:MM-HH:MM - [Your clear grouped summary] [YOUR_ASSESSED_SEVERITY]
• HH:MM - [Your clear summary] [YOUR_ASSESSED_SEVERITY]

Total Evidence: <count> events
Severity breakdown: [Your calculated breakdown, e.g., "High: 5, Medium: 3, Info: 2"]

RULES:
- Group identical events into ONE line
- Use time ranges (HH:MM-HH:MM) when showing grouped events
- Assess severity based on security impact, not just event type
- Keep descriptions clear and concise
- Just output the structured summary - no explanations or meta-commentary"""

messages = [{"role": "user", "content": user_message}]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
output = model.generate(input_ids, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=True))

Training Details

Dataset

The training data is publicly available at stratosphere/immune-summary-sft-dataset.

  • Source: 532 incidents from real Slips IDS network captures
  • Responses: 4 model responses per incident (GPT-4o, GPT-4o-mini, Qwen2.5 3B, Qwen2.5 1B) used as candidate labels, scored by an LLM-as-judge
  • Selection: Best-of-N — highest-scoring response per incident used as training target
  • Filtering: Responses with judge score < 4 or summary token length outside [50, 400] discarded
  • Split: 90% train / 10% eval (stratified, seed=42)

Training Procedure

Parameter Value
Base model unsloth/Qwen2.5-1.5B-Instruct
Training method SFT (Supervised Fine-Tuning)
Framework Unsloth + TRL SFTTrainer
LoRA rank (r) 64
LoRA alpha 64
LoRA dropout 0.0
RSLoRA enabled (required at r=64)
LoRA targets q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Sequence length 4096
Batch size 1 (effective: 16 via gradient accumulation)
Gradient accumulation steps 16
Learning rate 2e-5
LR scheduler cosine
Warmup steps 20
Weight decay 0.01
Epochs 3
Optimizer adamw_8bit
Precision BF16
Quantization (training) 4bit
Hardware A100 80GB MiG 20GB slice (e-infra.cz cloud)
Response masking train_on_responses_only — loss computed on assistant turns only

Framework Versions

  • Unsloth: 2026.3.18
  • Transformers: (auto-detected)
  • PyTorch: (auto-detected)

Evaluation

Evaluated on 47 held-out Slips IDS incidents using gpt-oss-120b as an independent LLM-as-judge. The judge ranked all 5 model responses per incident simultaneously on a 1–10 scale, with model labels randomized per incident to prevent position bias. Inference was performed with the merged prompt format (instructions + DAG combined in a single user message, no system prompt) at 4096 max input tokens.

Overall Results

Rank Model Avg Score Avg Position Win Rate Wins
1 GPT-4o-mini 6.89/10 1.81 42.6% 20
2 GPT-4o 5.87/10 2.38 29.8% 14
3 Qwen2.5-1.5B (finetuned) 4.70/10 3.21 19.1% 9
4 Qwen2.5 3B (baseline) 4.57/10 3.40 8.5% 4
5 Qwen2.5 1B (baseline) 3.36/10 4.19 0.0% 0

The finetuned 1.5B model beats both untuned baselines (+1.34 avg score vs Qwen2.5 1B, +0.13 vs Qwen2.5 3B) and achieves a 19.1% win rate — higher than the 3B baseline (8.5%).

By Complexity

Complexity Events Finetuned Score GPT-4o-mini Score GPT-4o Score
Simple < 500 (31 incidents) 5.45/10 6.74/10 5.61/10
Medium 500–1999 (7 incidents) 3.43/10 6.71/10 5.71/10
Complex ≥ 2000 (9 incidents) 3.11/10 7.56/10 6.89/10

On simple incidents the finetuned model is competitive with GPT-4o (5.45 vs 5.61). Medium and complex incidents are the primary weakness, consistent with context length limitations at 4096 tokens.

By Category

Category Finetuned Score Finetuned Win Rate
Malware (45 incidents) 4.82/10 20.0%
Normal (2 incidents) 2.00/10 0.0%

Readability

An automated readability analysis on the 47 held-out incidents shows the model achieves a compression ratio of 0.26 with 373 abstracted bullets, 256 verbatim lines, and 44 markdown fences — indicating the model learned to paraphrase and summarize rather than echo the input DAG.

Known Limitation: Complex Incident Performance

The model struggles on medium and complex incidents (≥ 500 events), scoring 3.43/10 (medium) and 3.11/10 (complex) with 0 wins in both tiers. Large DAGs exceed the effective 4096-token context window, resulting in inference errors on the largest inputs. Reducing the input token limit to match the training sequence length mitigates but does not fully resolve this.

Intended Use

  • Automated triage of Slips IDS alerts for security analysts
  • First-pass summarization of network incident logs
  • Input to downstream reporting or ticketing workflows

Out-of-Scope Use

  • General-purpose chat or instruction following
  • Security domains outside network IDS (malware analysis, vulnerability scanning, etc.)
  • Non-English inputs

Citation

@misc{qwen2.5-1.5b-slips-immune,
  title        = {Qwen2.5-1.5B fine-tuned for Slips IDS security summarization},
  author       = {Stratosphere Laboratory, CTU Prague},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/stratosphere/qwen2.5-1.5b-slips-immune-summarization}}
}

Model Details

  • Model size: 1.5B params
  • Tensor type: FP16
  • License: Apache-2.0
  • Tags: Text Generation, Transformers, Safetensors, Network Security, IDS, SLIPS, Summarization, Cybersecurity, LoRA, SFT, TRL, Unsloth
Downloads last month
30
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
Input a message to start chatting with stratosphere/qwen2.5-1.5b-slips-immune-summarization.

Model tree for stratosphere/qwen2.5-1.5b-slips-immune-summarization

Adapter
(482)
this model

Dataset used to train stratosphere/qwen2.5-1.5b-slips-immune-summarization