octodex-v1

A fine-tuned code retrieval embedding model optimized for locating files that need to be modified to resolve a software engineering task. Built for AI coding agents targeting SWE-bench.

What This Model Does

Given a natural-language description of a bug or feature request, octodex-v1 retrieves the source code files most likely to require modification. Unlike general-purpose code search models that optimize for "relevance," octodex-v1 is trained on helpfulness — which files a developer actually needed to edit to resolve the issue.

This distinction matters. A generic code search model might surface the file that mentions the error, but octodex-v1 surfaces the file that fixes it.

Performance

Evaluated on SWE-bench Verified (500 instances, held out during training):

Metric	Baseline (unfinetuned bge-code-v1)	octodex-v1	Improvement
MRR	0.1002	0.6377	+537%
MAP	0.0848	0.5631	+564%
Recall@1	0.031	0.368	+12x
Recall@5	0.094	0.690	+7.3x
Recall@10	0.152	0.789	+5.2x
Recall@20	0.251	0.865	+3.4x
Accuracy@1	0.040	0.496	+12.4x
Accuracy@10	0.218	0.888	+4.1x

Model Details

Base model: BAAI/bge-code-v1 (2B params, Qwen2 backbone, 1536-dim embeddings)
Fine-tuning: LoRA (rank=16, alpha=16, targets=q_proj+v_proj, dropout=0.1)
Trainable params: 2.18M (0.14% of total)
Loss: MultipleNegativesRankingLoss (InfoNCE) with in-batch negatives + explicit hard negatives
Training data: 75,193 triples from 21,302 SWE-bench instances (train+test splits)
Eval data: 948 triples from 500 SWE-bench Verified instances
Max sequence length: 512 tokens
Training hardware: Apple M4 Pro (64GB unified, 20-core GPU, MPS)
Training time: 229.4 hours (9.6 days), 7,050 steps, 3 epochs
Best checkpoint: Step 6,000

Training Data Construction

Each training triple is:

Query: Issue title + body (the bug report / feature request)
Positive: Source code chunk from a file that was actually modified in the gold patch
Negative: Source code chunk from the same repo that was NOT modified (hard negative, selected by file proximity)

Source code is chunked at function/class boundaries using tree-sitter (Python, Go, JavaScript, TypeScript).

What's NOT in the training data

SWE-bench Pro (731 instances) was never used for training or evaluation. It is the held-out test set for downstream agent evaluation. There is zero repository overlap between training and Pro instances.

Provided Artifacts

File	Size	Description
`adapter_model.safetensors`	~8.7 MB	LoRA adapter weights (apply to bge-code-v1 base)
`adapter_config.json`	~1 KB	PEFT/LoRA configuration
`onnx/model-int8.onnx`	~1.4 GB	INT8 quantized ONNX model (merged base + LoRA, self-contained)
`tokenizer.json`	~11 MB	HuggingFace fast tokenizer
`config.json`	~1.3 KB	Model configuration

Usage

Python (sentence-transformers + PEFT)

from sentence_transformers import SentenceTransformer
from peft import PeftModel

# Load base model
model = SentenceTransformer("BAAI/bge-code-v1")
model.max_seq_length = 512

# Apply LoRA adapter
model[0].auto_model = PeftModel.from_pretrained(
    model[0].auto_model, "Travis1282/octodex-v1"
)
model[0].auto_model.eval()

# Encode
query = "Fix authentication failure when session token expires during long API calls"
code = "def validate_session(token, max_age=3600):\n    payload = jwt.decode(token)\n    return time.time() - payload['iat'] < max_age"

embeddings = model.encode([query, code])
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")

Python (ONNX Runtime, no GPU required)

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_padding(length=512, pad_id=0)

session = ort.InferenceSession("onnx/model-int8.onnx")

encoding = tokenizer.encode("your query or code here")
outputs = session.run(None, {
    "input_ids": np.array([encoding.ids], dtype=np.int64),
    "attention_mask": np.array([encoding.attention_mask], dtype=np.int64),
    "position_ids": np.arange(512, dtype=np.int64).reshape(1, -1),
})

# Mean pool + L2 normalize
hidden = outputs[0]
mask = np.array(encoding.attention_mask, dtype=np.float32)
pooled = (hidden[0] * mask[:, None]).sum(0) / mask.sum()
embedding = pooled / np.linalg.norm(pooled)

Rust (via ort crate)

The octosearch crate in the octoblack workspace provides a Rust inference implementation using ort (ONNX Runtime bindings) and tokenizers. See crates/octosearch/ for the full implementation.

Limitations

Trained on SWE-bench data which skews toward popular open-source Python/Go/JS/TS repositories
Max sequence length of 512 tokens — longer code chunks are truncated
INT8 quantization introduces ~1% cosine similarity variation between different batch sizes (dynamic quantization artifact, does not affect retrieval ranking)
The model retrieves files likely to need modification, not the exact edit locations within those files

Citation

If you use this model, please cite:

@misc{octodex-v1,
  title={octodex-v1: Fine-tuned Code Retrieval for AI Coding Agents},
  author={Travis Clark},
  year={2026},
  url={https://huggingface.co/Travis1282/octodex-v1}
}

Downloads last month: -

Model tree for robot-octopus/octodex-v1

Base model

BAAI/bge-code-v1

Adapter

(1)

this model

Datasets used to train robot-octopus/octodex-v1

Evaluation results

MRR on SWE-bench Verified
test set self-reported

0.638
MAP on SWE-bench Verified
test set self-reported

0.563
Recall@1 on SWE-bench Verified
test set self-reported

0.368
Recall@5 on SWE-bench Verified
test set self-reported

0.690
Recall@10 on SWE-bench Verified
test set self-reported

0.789
Recall@20 on SWE-bench Verified
test set self-reported

0.865
Accuracy@1 on SWE-bench Verified
test set self-reported

0.496
Accuracy@10 on SWE-bench Verified
test set self-reported

0.888