๐Ÿ›ก๏ธ Guard Safety Classifier

A multi-task safety classifier based on DeBERTa-v3-small trained on 3.9M+ samples for content moderation and safety detection.

๐ŸŽฏ Model Tasks

This model performs three simultaneous predictions:

  1. Binary Safety Classification (is_safe)

    • โœ… Safe content
    • โš ๏ธ Unsafe content
  2. Single-Label Category Classification (category)

    • Identifies the primary safety concern category
  3. Multi-Label Categories (categories)

    • Can detect multiple safety issues simultaneously

๐Ÿ“Š Performance Metrics

Metric Score
is_safe Accuracy 92.76%
category F1 0.5037
categories F1 0.9068
Test Loss 1.0233

๐Ÿš€ Quick Start

import torch
from transformers import AutoTokenizer
import pickle

# Load model and tokenizer
model_name = "YOUR_USERNAME/guard-safety-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model architecture
from your_model_file import MultiTaskSafetyClassifier
model = MultiTaskSafetyClassifier(
    model_name="microsoft/deberta-v3-small",
    num_categories=NUM_CATEGORIES,
    num_multi_labels=NUM_MULTI_LABELS
)

# Load weights
model.load_state_dict(torch.load("model_weights.pt"))
model.eval()

# Load label encoders
with open("label_encoders.pkl", "rb") as f:
    encoders = pickle.load(f)
    le_category = encoders['le_category']
    mlb = encoders['mlb']

# Inference
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt", max_length=128, 
                   truncation=True, padding=True)

with torch.no_grad():
    outputs = model(**inputs)
    
is_safe = torch.softmax(outputs['is_safe'], dim=1)[0][1].item() > 0.5
category = le_category.inverse_transform([outputs['category'].argmax(1).item()])[0]
categories = mlb.inverse_transform((torch.sigmoid(outputs['categories']) > 0.5).cpu().numpy())[0]

print(f"Is Safe: {is_safe}")
print(f"Category: {category}")
print(f"Categories: {list(categories)}")

๐Ÿ—๏ธ Model Architecture

  • Base Model: microsoft/deberta-v3-small (141M parameters)
  • Hidden Size: 768
  • Max Sequence Length: 128 tokens
  • Training Framework: PyTorch + Transformers

๐Ÿ“š Training Details

  • Dataset: budecosystem/guardrail-training-data
  • Training Samples: 3,182,844
  • Validation Samples: 397,855
  • Test Samples: 397,856
  • Batch Size: 64
  • Learning Rate: 2e-5
  • Epochs: 1
  • Optimizer: AdamW with linear warmup
  • Hardware: NVIDIA Tesla T4 (16GB)
  • Training Time: ~8 hours

๐Ÿท๏ธ Categories

The model can identify the following safety categories:

[
  "animal_abuse",
  "benign",
  "child_abuse",
  "code_vulnerabilities",
  "controversial_topics_politics",
  "cwe_compliance",
  "dangerous_expert_advice",
  "discrimination_stereotype_injustice",
  "drug_abuse_weapons_banned_substance",
  "financial_crime_property_crime_theft",
  "fraud_deception_misinformation",
  "gender_bias",
  "hate_speech_offensive_language",
  "jailbreak_prompt_injection",
  "malware_hacking_cyberattack",
  "misinformation_regarding_ethics_laws_and_safety",
  "mitre_compliance",
  "non_violent_unethical_behavior",
  "orientation_bias",
  "privacy_violation",
  "race_bias",
  "religious_bias",
  "self_harm",
  "sexually_explicit_adult_content",
  "terrorism_organized_crime",
  "violence_aiding_and_abetting_incitement"
]

๐Ÿ”ข Multi-Label Classes

[
  " ",
  ",",
  "_",
  "a",
  "b",
  "c",
  "d",
  "e",
  "f",
  "g",
  "h",
  "i",
  "j",
  "k",
  "l",
  "m",
  "n",
  "o",
  "p",
  "r",
  "s",
  "t",
  "u",
  "v",
  "w",
  "x",
  "y",
  "z"
]

โš™๏ธ Configuration

Full model configuration is available in config.json

๐Ÿ“„ License

Apache 2.0

๐Ÿ™ Acknowledgments

๐Ÿ“ฎ Contact

For questions or issues, please open an issue on the model repository.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train jainsatyam26/guard-safety-classifier