You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

PathoPreter: A Parameter-Efficient Clinical Support Model for SNV Risk Flagging

⚠️ CRITICAL DISCLAIMER

PathoPreter is a clinical research tool for risk prioritization, NOT a diagnostic device.

DO NOT use this model to confirm or rule out a medical diagnosis.
DO NOT use this model to determine medical treatment.
DO NOT use this model as a replacement for ACMG guidelines or expert review.

The model outputs "High/Low Pathogenic Indication" signals intended solely to help clinicians prioritize variants for further manual investigation.

🔬 Model Overview

PathoPreter-4B-SNV is a specialized Large Language Model (LLM) fine-tuned to screen Single Nucleotide Variants (SNVs) for pathogenicity. Unlike generalist biomedical models, PathoPreter focuses on domain saturation—learning a dense representation of variant risk factors from over 1.1 million ClinVar records.

Developer: Rohit Yadav (NIT Jalandhar)
Base Architecture: Qwen-3 Instruct 4B
Training Method: Low-Rank Adaptation (LoRA) via Unsloth on NVIDIA A100
Input Data: HGVS variant strings + Gene context + Associated Conditions.
Output: Semantic Risk Flag (High/Low Indication).

Key Features

Variant-Level Isolation: Trained with strict HGVS separation to ensure the model learns biological patterns, not just memorizes training data.
Deployable: Runs offline on consumer hardware (8-12GB VRAM) via GGUF.
Safety-Aligned: Outputs are structurally constrained to prevent diagnostic overreach.

📊 Performance & Benchmarking

The model was evaluated on a strictly isolated test set of 55,376 unseen variants. To prevent data leakage, we enforced HGVS-level isolation, ensuring no variant string from the training set appeared in the evaluation set.

Core Metrics (vs. Base Model)

Metric	PathoPreter-4B	Raw Base Model Qwen-2.5-7B	Clinical Implication
Pathogenic Recall	94.0%	0%	Successfully flags 94% of high-risk variants whereas Raw models got all wrong and said all varaints as Benign
Benign Specificity	99.2%	100%	PathoPreter Rarely hallucinates risk on safe variants. But raw model always says Benign
Overall Accuracy	98.57%	87%*	High reliability across the full distribution even though raw model looks good in Acc bcause dataset have more brnign but it never caught any pathogens where as PathoPreter show 94% recall in Pathogens

*Note: The raw base model achieved 87% accuracy simply by predicting "Benign" for everything, failing to catch a single pathogenic case. PathoPreter's accuracy reflects actual signal detection.

Industry Comparison (vs. CADD)

On a subset of 1,937 variants, PathoPreter was benchmarked against CADD (PHRED ≥ 20), a standard bioinformatics tool.

CADD Recall: ~99.0%
PathoPreter Recall: ~94.0%

Result: PathoPreter achieves sensitivity comparable to established algorithmic tools while operating entirely on text-based clinical metadata, without requiring complex evolutionary conservation pipelines.

🛠️ Usage

Method 1: Python (Unsloth / Transformers)

Use this if you are a developer wanting to run the full LoRA adapters.

from unsloth import FastLanguageModel

# 1. Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# 2. Define the input prompt
prompt = """
Variant: NM_000059.4(BRCA2):c.8499G>A (p.Lys2833=)
Associated conditions: Hereditary breast ovarian cancer syndrome
### Response:
"""

# 3. Inference
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 12)
print(tokenizer.batch_decode(outputs))

Method 2: Local Inference (LM Studio / Ollama)

This repo includes GGUF files for easy local use.

Download qwen3-4b-instruct-2507.Q4_K_M.gguf.

Load into LM Studio or Ollama.

System Prompt: "You are an expert genetic variant classifier. Classify variants as Pathogenic or Benign based on the input."

📂 Training Details

Dataset Construction The training data was derived from a snapshot of ClinVar (NCBI) containing 1.1 million SNV records.

Filtered For: Single Nucleotide Variants (SNVs) only.

Labels: Binary mapping (Pathogenic/Likely Pathogenic → 1, Benign/Likely Benign → 0).

Exclusions: VUS (Variants of Uncertain Significance), conflicting interpretations, and incomplete records were removed.

Semantic Output Layer To prevent misuse, the model maps binary internal predictions to safe clinical language:

1 (Internal) → "High Pathogenic Indication"

0 (Internal) → "Low Pathogenic Indication"

🛑 Limitations

Scope: Validated only for SNVs. Performance on Indels, CNVs, or Structural Variants is unknown and likely poor.

Binary Output: The model does not currently handle VUS (Variants of Uncertain Significance) complexity.

Hallucination: Like all LLMs, the model can hallucinate. It is not a clinical diagnosis tool

🔗 Resources

GitHub Repository: YADAV1825/PathoPreter

ClinVar Database: NCBI ClinVar

Downloads last month: 56

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

4-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(127)

this model