File size: 6,310 Bytes
3fe03fb
6fc3921
22ed853
dc8bf5e
22ed853
 
 
 
3fe03fb
22ed853
 
 
 
 
 
 
 
 
 
 
6fc3921
22ed853
 
3fe03fb
 
22ed853
1a9ae87
22ed853
1a9ae87
22ed853
1a9ae87
22ed853
1a9ae87
22ed853
1a9ae87
22ed853
1a9ae87
22ed853
1a9ae87
22ed853
 
 
 
1a9ae87
22ed853
1a9ae87
22ed853
 
 
 
 
 
 
 
 
 
 
 
 
 
1a9ae87
22ed853
 
 
1a9ae87
22ed853
 
 
1a9ae87
22ed853
 
 
 
 
 
 
 
1a9ae87
22ed853
 
 
 
 
 
1a9ae87
22ed853
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a9ae87
22ed853
a69dc44
22ed853
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
library_name: peft
pipeline_tag: text-generation
license: bigcode-openrail-m
language:
  - code
base_model:
  - bigcode/starcoder2-15b-instruct-v0.1
tags:
  - securecode
  - security
  - owasp
  - code-generation
  - secure-coding
  - lora
  - qlora
  - vulnerability-detection
  - cybersecurity
datasets:
  - scthornton/securecode
model-index:
  - name: starcoder2-15b-securecode
    results: []
---

# StarCoder2 15B SecureCode

[![Parameters](https://img.shields.io/badge/parameters-15B-blue.svg)](#model-details) [![Dataset](https://img.shields.io/badge/dataset-2,185_examples-green.svg)](https://huggingface.co/datasets/scthornton/securecode) [![OWASP](https://img.shields.io/badge/OWASP-Top_10_2021_+_LLM_Top_10-red.svg)](#security-coverage) [![Method](https://img.shields.io/badge/method-QLoRA-purple.svg)](#training-details) [![License](https://img.shields.io/badge/license-BigCode_OpenRAIL--M-orange.svg)](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)

**Open-source flagship security-aware code generation model. Fine-tuned on 2,185 real-world vulnerability examples covering OWASP Top 10 2021 and OWASP LLM Top 10 2025.**

[Dataset](https://huggingface.co/datasets/scthornton/securecode) | [Paper](https://huggingface.co/papers/2512.18542) | [Model Collection](https://huggingface.co/collections/scthornton/securecode) | [perfecXion.ai](https://perfecxion.ai) | [Blog Post](https://huggingface.co/blog/scthornton/securecode-models)

---

## What This Model Does

StarCoder2 15B SecureCode generates security-aware code by teaching the model to recognize vulnerability patterns and produce secure implementations. Every training example includes:

- **Real-world incident grounding** — Tied to documented CVEs and breach reports
- **Vulnerable + secure implementations** — Side-by-side comparison
- **Attack demonstrations** — Concrete exploit code
- **Defense-in-depth guidance** — SIEM rules, logging, monitoring, infrastructure hardening

---

## Model Details

| Property | Value |
|----------|-------|
| **Base Model** | [bigcode/starcoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-15b-instruct-v0.1) |
| **Parameters** | 15B |
| **Architecture** | GPT-2 (StarCoder2) |
| **Method** | QLoRA (4-bit quantization + LoRA) |
| **LoRA Rank** | 16 |
| **LoRA Alpha** | 32 |
| **Training Data** | [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples) |
| **Training Time** | ~1h 40min |
| **Hardware** | 2x NVIDIA A100 40GB (GCP) |
| **Framework** | PEFT 0.18.1, Transformers 5.1.0, PyTorch 2.7.1 |

---

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model + LoRA adapter
base_model = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder2-15b-instruct-v0.1",
    device_map="auto",
    load_in_4bit=True
)
model = PeftModel.from_pretrained(base_model, "scthornton/starcoder2-15b-securecode")
tokenizer = AutoTokenizer.from_pretrained("scthornton/starcoder2-15b-securecode")

# Generate secure code
prompt = "Write a secure JWT authentication handler in Python with proper token validation"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

---

## Training Details

| Hyperparameter | Value |
|----------------|-------|
| Learning Rate | 2e-4 |
| Batch Size | 1 |
| Gradient Accumulation | 16 |
| Epochs | 3 |
| Scheduler | Cosine |
| Warmup Steps | 100 |
| Optimizer | paged_adamw_8bit |
| Max Sequence Length | 2048 |

### Dataset Breakdown

| Component | Examples | Coverage |
|-----------|----------|----------|
| Web Security (OWASP Top 10:2021) | 1,378 | 12 languages, 9 frameworks |
| AI/ML Security (OWASP LLM Top 10:2025) | 750 | Prompt injection, RAG poisoning, model theft |
| Framework-Specific Additions | 219 | Django, Flask, Express, Spring Boot, etc. |
| **Total** | **2,185** | **Complete OWASP coverage** |

---

## SecureCode Model Collection

| Model | Parameters | Base | Training Time | Link |
|-------|------------|------|---------------|------|
| Llama 3.2 3B | 3B | Meta Llama 3.2 | 1h 5min | [scthornton/llama-3.2-3b-securecode](https://huggingface.co/scthornton/llama-3.2-3b-securecode) |
| Qwen Coder 7B | 7B | Qwen 2.5 Coder | 1h 24min | [scthornton/qwen-coder-7b-securecode](https://huggingface.co/scthornton/qwen-coder-7b-securecode) |
| CodeGemma 7B | 7B | Google CodeGemma | 1h 27min | [scthornton/codegemma-7b-securecode](https://huggingface.co/scthornton/codegemma-7b-securecode) |
| DeepSeek Coder 6.7B | 6.7B | DeepSeek Coder | 1h 15min | [scthornton/deepseek-coder-6.7b-securecode](https://huggingface.co/scthornton/deepseek-coder-6.7b-securecode) |
| CodeLlama 13B | 13B | Meta CodeLlama | 1h 32min | [scthornton/codellama-13b-securecode](https://huggingface.co/scthornton/codellama-13b-securecode) |
| Qwen Coder 14B | 14B | Qwen 2.5 Coder | 1h 19min | [scthornton/qwen2.5-coder-14b-securecode](https://huggingface.co/scthornton/qwen2.5-coder-14b-securecode) |
| **StarCoder2 15B** | **15B** | **BigCode StarCoder2** | **1h 40min** | **This model** |
| Granite 20B | 20B | IBM Granite Code | 1h 19min | [scthornton/granite-20b-code-securecode](https://huggingface.co/scthornton/granite-20b-code-securecode) |

---

## Citation

```bibtex
@misc{thornton2025securecode,
  title={SecureCode v2.0: A Production-Grade Dataset for Training Security-Aware Code Generation Models},
  author={Thornton, Scott},
  year={2025},
  publisher={perfecXion.ai},
  url={https://perfecxion.ai/articles/securecode-v2-dataset-paper.html},
  note={Model: https://huggingface.co/scthornton/starcoder2-15b-securecode}
}
```

---

## Links

- **Dataset**: [scthornton/securecode](https://huggingface.co/datasets/scthornton/securecode) (2,185 examples)
- **Paper**: [SecureCode v2.0](https://huggingface.co/papers/2512.18542)
- **Model Collection**: [SecureCode Models](https://huggingface.co/collections/scthornton/securecode) (8 models)
- **Blog Post**: [Training Security-Aware Code Models](https://huggingface.co/blog/scthornton/securecode-models)
- **Publisher**: [perfecXion.ai](https://perfecxion.ai)

---

## License

BigCode OpenRAIL-M