Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification
Abstract
Severity classification in system logs serves as a benchmark for evaluating model comprehension and deployability, with retrieval-augmented generation improving performance across small language models while efficiency varies significantly.
System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited standalone practical value, revealing little about its underlying ability to interpret system logs. We argue that severity classification is more informative when treated as a benchmark for probing runtime log comprehension rather than as an end task. Using real-world journalctl data from Linux production servers, we evaluate nine small language models (SLMs) and small reasoning language models (SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. The results reveal strong stratification. Qwen3-4B achieves the highest accuracy at 95.64% with RAG, while Gemma3-1B improves from 20.25% under few-shot prompting to 85.28% with RAG. Notably, the tiny Qwen3-0.6B reaches 88.12% accuracy despite weak performance without retrieval. In contrast, several SRLMs, including Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B, degrade substantially when paired with RAG. Efficiency measurements further separate models: most Gemma and Llama variants complete inference in under 1.2 seconds per log, whereas Phi-4-Mini-Reasoning exceeds 228 seconds per log while achieving <10% accuracy. These findings suggest that (1) architectural design, (2) training objectives, and (3) the ability to integrate retrieved context under strict output constraints jointly determine performance. By emphasizing small, deployable models, this benchmark aligns with real-time requirements of digital twin (DT) systems and shows that severity classification serves as a lens for evaluating model competence and real-time deployability, with implications for root cause analysis (RCA) and broader DT integration.
Community
We evaluate 9 open-source models under zero-shot, few-shot, and RAG (FAISS) and measure both accuracy + per-log latency. Main takeaway: RAG can massively help small models (Qwen3-4B: 95.64%, Gemma3-1B: 85.28%), but some reasoning-focused models degrade with retrieval, showing that retrieval integration isn’t uniform across architectures.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LogICL: Distilling LLM Reasoning to Bridge the Semantic Gap in Cross-Domain Log Anomaly Detection (2025)
- LLM-SrcLog: Towards Proactive and Unified Log Template Extraction via Large Language Models (2025)
- TAAF: A Trace Abstraction and Analysis Framework Synergizing Knowledge Graphs and LLMs (2026)
- Towards Small Language Models for Security Query Generation in SOC Workflows (2025)
- A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction (2025)
- Retrieval-Augmented Few-Shot Prompting Versus Fine-Tuning for Code Vulnerability Detection (2025)
- The Instruction Gap: LLMs get lost in Following Instruction (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper