octodex-v1

A fine-tuned code retrieval embedding model optimized for locating files that need to be modified to resolve a software engineering task. Built for AI coding agents targeting SWE-bench.

What This Model Does

Given a natural-language description of a bug or feature request, octodex-v1 retrieves the source code files most likely to require modification. Unlike general-purpose code search models that optimize for "relevance," octodex-v1 is trained on helpfulness — which files a developer actually needed to edit to resolve the issue.

This distinction matters. A generic code search model might surface the file that mentions the error, but octodex-v1 surfaces the file that fixes it.

Performance

Evaluated on SWE-bench Verified (500 instances, held out during training):

Metric Baseline (unfinetuned bge-code-v1) octodex-v1 Improvement
MRR 0.1002 0.6377 +537%
MAP 0.0848 0.5631 +564%
Recall@1 0.031 0.368 +12x
Recall@5 0.094 0.690 +7.3x
Recall@10 0.152 0.789 +5.2x
Recall@20 0.251 0.865 +3.4x
Accuracy@1 0.040 0.496 +12.4x
Accuracy@10 0.218 0.888 +4.1x

Model Details

  • Base model: BAAI/bge-code-v1 (2B params, Qwen2 backbone, 1536-dim embeddings)
  • Fine-tuning: LoRA (rank=16, alpha=16, targets=q_proj+v_proj, dropout=0.1)
  • Trainable params: 2.18M (0.14% of total)
  • Loss: MultipleNegativesRankingLoss (InfoNCE) with in-batch negatives + explicit hard negatives
  • Training data: 75,193 triples from 21,302 SWE-bench instances (train+test splits)
  • Eval data: 948 triples from 500 SWE-bench Verified instances
  • Max sequence length: 512 tokens
  • Training hardware: Apple M4 Pro (64GB unified, 20-core GPU, MPS)
  • Training time: 229.4 hours (9.6 days), 7,050 steps, 3 epochs
  • Best checkpoint: Step 6,000

Training Data Construction

Each training triple is:

  • Query: Issue title + body (the bug report / feature request)
  • Positive: Source code chunk from a file that was actually modified in the gold patch
  • Negative: Source code chunk from the same repo that was NOT modified (hard negative, selected by file proximity)

Source code is chunked at function/class boundaries using tree-sitter (Python, Go, JavaScript, TypeScript).

What's NOT in the training data

SWE-bench Pro (731 instances) was never used for training or evaluation. It is the held-out test set for downstream agent evaluation. There is zero repository overlap between training and Pro instances.

Provided Artifacts

File Size Description
adapter_model.safetensors ~8.7 MB LoRA adapter weights (apply to bge-code-v1 base)
adapter_config.json ~1 KB PEFT/LoRA configuration
onnx/model-int8.onnx ~1.4 GB INT8 quantized ONNX model (merged base + LoRA, self-contained)
tokenizer.json ~11 MB HuggingFace fast tokenizer
config.json ~1.3 KB Model configuration

Usage

Python (sentence-transformers + PEFT)

from sentence_transformers import SentenceTransformer
from peft import PeftModel

# Load base model
model = SentenceTransformer("BAAI/bge-code-v1")
model.max_seq_length = 512

# Apply LoRA adapter
model[0].auto_model = PeftModel.from_pretrained(
    model[0].auto_model, "Travis1282/octodex-v1"
)
model[0].auto_model.eval()

# Encode
query = "Fix authentication failure when session token expires during long API calls"
code = "def validate_session(token, max_age=3600):\n    payload = jwt.decode(token)\n    return time.time() - payload['iat'] < max_age"

embeddings = model.encode([query, code])
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")

Python (ONNX Runtime, no GPU required)

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_truncation(max_length=512)
tokenizer.enable_padding(length=512, pad_id=0)

session = ort.InferenceSession("onnx/model-int8.onnx")

encoding = tokenizer.encode("your query or code here")
outputs = session.run(None, {
    "input_ids": np.array([encoding.ids], dtype=np.int64),
    "attention_mask": np.array([encoding.attention_mask], dtype=np.int64),
    "position_ids": np.arange(512, dtype=np.int64).reshape(1, -1),
})

# Mean pool + L2 normalize
hidden = outputs[0]
mask = np.array(encoding.attention_mask, dtype=np.float32)
pooled = (hidden[0] * mask[:, None]).sum(0) / mask.sum()
embedding = pooled / np.linalg.norm(pooled)

Rust (via ort crate)

The octosearch crate in the octoblack workspace provides a Rust inference implementation using ort (ONNX Runtime bindings) and tokenizers. See crates/octosearch/ for the full implementation.

Limitations

  • Trained on SWE-bench data which skews toward popular open-source Python/Go/JS/TS repositories
  • Max sequence length of 512 tokens — longer code chunks are truncated
  • INT8 quantization introduces ~1% cosine similarity variation between different batch sizes (dynamic quantization artifact, does not affect retrieval ranking)
  • The model retrieves files likely to need modification, not the exact edit locations within those files

Citation

If you use this model, please cite:

@misc{octodex-v1,
  title={octodex-v1: Fine-tuned Code Retrieval for AI Coding Agents},
  author={Travis Clark},
  year={2026},
  url={https://huggingface.co/Travis1282/octodex-v1}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for robot-octopus/octodex-v1

Base model

BAAI/bge-code-v1
Adapter
(1)
this model

Datasets used to train robot-octopus/octodex-v1

Evaluation results