octodex-v1
A fine-tuned code retrieval embedding model optimized for locating files that need to be modified to resolve a software engineering task. Built for AI coding agents targeting SWE-bench.
What This Model Does
Given a natural-language description of a bug or feature request, octodex-v1 retrieves the source code files most likely to require modification. Unlike general-purpose code search models that optimize for "relevance," octodex-v1 is trained on helpfulness — which files a developer actually needed to edit to resolve the issue.
This distinction matters. A generic code search model might surface the file that mentions the error, but octodex-v1 surfaces the file that fixes it.
Performance
Evaluated on SWE-bench Verified (500 instances, held out during training):
| Metric | Baseline (unfinetuned bge-code-v1) | octodex-v1 | Improvement |
|---|---|---|---|
| MRR | 0.1002 | 0.6377 | +537% |
| MAP | 0.0848 | 0.5631 | +564% |
| Recall@1 | 0.031 | 0.368 | +12x |
| Recall@5 | 0.094 | 0.690 | +7.3x |
| Recall@10 | 0.152 | 0.789 | +5.2x |
| Recall@20 | 0.251 | 0.865 | +3.4x |
| Accuracy@1 | 0.040 | 0.496 | +12.4x |
| Accuracy@10 | 0.218 | 0.888 | +4.1x |
Model Details
- Base model: BAAI/bge-code-v1 (2B params, Qwen2 backbone, 1536-dim embeddings)
- Fine-tuning: LoRA (rank=16, alpha=16, targets=q_proj+v_proj, dropout=0.1)
- Trainable params: 2.18M (0.14% of total)
- Loss: MultipleNegativesRankingLoss (InfoNCE) with in-batch negatives + explicit hard negatives
- Training data: 75,193 triples from 21,302 SWE-bench instances (train+test splits)
- Eval data: 948 triples from 500 SWE-bench Verified instances
- Max sequence length: 512 tokens
- Training hardware: Apple M4 Pro (64GB unified, 20-core GPU, MPS)
- Training time: 229.4 hours (9.6 days), 7,050 steps, 3 epochs
- Best checkpoint: Step 6,000
Training Data Construction
Each training triple is:
- Query: Issue title + body (the bug report / feature request)
- Positive: Source code chunk from a file that was actually modified in the gold patch
- Negative: Source code chunk from the same repo that was NOT modified (hard negative, selected by file proximity)
Source code is chunked at function/class boundaries using tree-sitter (Python, Go, JavaScript, TypeScript).
What's NOT in the training data
SWE-bench Pro (731 instances) was never used for training or evaluation. It is the held-out test set for downstream agent evaluation. There is zero repository overlap between training and Pro instances.
Provided Artifacts
| File | Size | Description |
|---|---|---|
adapter_model.safetensors |
~8.7 MB | LoRA adapter weights (apply to bge-code-v1 base) |
adapter_config.json |
~1 KB | PEFT/LoRA configuration |
onnx/model-int8.onnx |
~1.4 GB | INT8 quantized ONNX model (merged base + LoRA, self-contained) |
tokenizer.json |
~11 MB | HuggingFace fast tokenizer |
config.json |
~1.3 KB | Model configuration |
Usage
Python (sentence-transformers + PEFT)
from sentence_transformers import SentenceTransformer
from peft import PeftModel
# Load base model
model = SentenceTransformer("BAAI/bge-code-v1")
model.max_seq_length = 512
# Apply LoRA adapter
model[0].auto_model = PeftModel.from_pretrained(
model[0].auto_model, "Travis1282/octodex-v1"
)
model[0].auto_model.eval()
# Encode
query = "Fix authentication failure when session token expires during long API calls"
code = "def validate_session(token, max_age=3600):\n payload = jwt.decode(token)\n return time.time() - payload['iat'] < max_age"
embeddings = model.encode([query, code])
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")
Python (ONNX Runtime, no GPU required)
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_padding(length=512, pad_id=0)
session = ort.InferenceSession("onnx/model-int8.onnx")
encoding = tokenizer.encode("your query or code here")
outputs = session.run(None, {
"input_ids": np.array([encoding.ids], dtype=np.int64),
"attention_mask": np.array([encoding.attention_mask], dtype=np.int64),
"position_ids": np.arange(512, dtype=np.int64).reshape(1, -1),
})
# Mean pool + L2 normalize
hidden = outputs[0]
mask = np.array(encoding.attention_mask, dtype=np.float32)
pooled = (hidden[0] * mask[:, None]).sum(0) / mask.sum()
embedding = pooled / np.linalg.norm(pooled)
Rust (via ort crate)
The octosearch crate in the octoblack workspace provides a Rust inference implementation using ort (ONNX Runtime bindings) and tokenizers. See crates/octosearch/ for the full implementation.
Limitations
- Trained on SWE-bench data which skews toward popular open-source Python/Go/JS/TS repositories
- Max sequence length of 512 tokens — longer code chunks are truncated
- INT8 quantization introduces ~1% cosine similarity variation between different batch sizes (dynamic quantization artifact, does not affect retrieval ranking)
- The model retrieves files likely to need modification, not the exact edit locations within those files
Citation
If you use this model, please cite:
@misc{octodex-v1,
title={octodex-v1: Fine-tuned Code Retrieval for AI Coding Agents},
author={Travis Clark},
year={2026},
url={https://huggingface.co/Travis1282/octodex-v1}
}
- Downloads last month
- -
Model tree for robot-octopus/octodex-v1
Base model
BAAI/bge-code-v1Datasets used to train robot-octopus/octodex-v1
Evaluation results
- MRR on SWE-bench Verifiedtest set self-reported0.638
- MAP on SWE-bench Verifiedtest set self-reported0.563
- Recall@1 on SWE-bench Verifiedtest set self-reported0.368
- Recall@5 on SWE-bench Verifiedtest set self-reported0.690
- Recall@10 on SWE-bench Verifiedtest set self-reported0.789
- Recall@20 on SWE-bench Verifiedtest set self-reported0.865
- Accuracy@1 on SWE-bench Verifiedtest set self-reported0.496
- Accuracy@10 on SWE-bench Verifiedtest set self-reported0.888