You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

PathoPreter: A Parameter-Efficient Clinical Support Model for SNV Risk Flagging

PathoPreter Badge Recall Base Model

โš ๏ธ CRITICAL DISCLAIMER

PathoPreter is a clinical research tool for risk prioritization, NOT a diagnostic device.

  • DO NOT use this model to confirm or rule out a medical diagnosis.
  • DO NOT use this model to determine medical treatment.
  • DO NOT use this model as a replacement for ACMG guidelines or expert review.

The model outputs "High/Low Pathogenic Indication" signals intended solely to help clinicians prioritize variants for further manual investigation.


๐Ÿ”ฌ Model Overview

PathoPreter-4B-SNV is a specialized Large Language Model (LLM) fine-tuned to screen Single Nucleotide Variants (SNVs) for pathogenicity. Unlike generalist biomedical models, PathoPreter focuses on domain saturationโ€”learning a dense representation of variant risk factors from over 1.1 million ClinVar records.

  • Developer: Rohit Yadav (NIT Jalandhar)
  • Base Architecture: Qwen-3 Instruct 4B
  • Training Method: Low-Rank Adaptation (LoRA) via Unsloth on NVIDIA A100
  • Input Data: HGVS variant strings + Gene context + Associated Conditions.
  • Output: Semantic Risk Flag (High/Low Indication).

Key Features

  • Variant-Level Isolation: Trained with strict HGVS separation to ensure the model learns biological patterns, not just memorizes training data.
  • Deployable: Runs offline on consumer hardware (8-12GB VRAM) via GGUF.
  • Safety-Aligned: Outputs are structurally constrained to prevent diagnostic overreach.

๐Ÿ“Š Performance & Benchmarking

The model was evaluated on a strictly isolated test set of 55,376 unseen variants. To prevent data leakage, we enforced HGVS-level isolation, ensuring no variant string from the training set appeared in the evaluation set.

Core Metrics (vs. Base Model)

Metric PathoPreter-4B Raw Base Model Qwen-2.5-7B Clinical Implication
Pathogenic Recall 94.0% 0% Successfully flags 94% of high-risk variants whereas Raw models got all wrong and said all varaints as Benign
Benign Specificity 99.2% 100% PathoPreter Rarely hallucinates risk on safe variants. But raw model always says Benign
Overall Accuracy 98.57% 87%* High reliability across the full distribution even though raw model looks good in Acc bcause dataset have more brnign but it never caught any pathogens where as PathoPreter show 94% recall in Pathogens

*Note: The raw base model achieved 87% accuracy simply by predicting "Benign" for everything, failing to catch a single pathogenic case. PathoPreter's accuracy reflects actual signal detection.

Industry Comparison (vs. CADD)

On a subset of 1,937 variants, PathoPreter was benchmarked against CADD (PHRED โ‰ฅ 20), a standard bioinformatics tool.

  • CADD Recall: ~99.0%
  • PathoPreter Recall: ~94.0%

Result: PathoPreter achieves sensitivity comparable to established algorithmic tools while operating entirely on text-based clinical metadata, without requiring complex evolutionary conservation pipelines.


๐Ÿ› ๏ธ Usage

Method 1: Python (Unsloth / Transformers)

Use this if you are a developer wanting to run the full LoRA adapters.

from unsloth import FastLanguageModel

# 1. Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

# 2. Define the input prompt
prompt = """
Variant: NM_000059.4(BRCA2):c.8499G>A (p.Lys2833=)
Associated conditions: Hereditary breast ovarian cancer syndrome
### Response:
"""

# 3. Inference
inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 12)
print(tokenizer.batch_decode(outputs))

Method 2: Local Inference (LM Studio / Ollama)

This repo includes GGUF files for easy local use.

Download qwen3-4b-instruct-2507.Q4_K_M.gguf.

Load into LM Studio or Ollama.

System Prompt: "You are an expert genetic variant classifier. Classify variants as Pathogenic or Benign based on the input."


๐Ÿ“‚ Training Details

Dataset Construction The training data was derived from a snapshot of ClinVar (NCBI) containing 1.1 million SNV records.

Filtered For: Single Nucleotide Variants (SNVs) only.

Labels: Binary mapping (Pathogenic/Likely Pathogenic โ†’ 1, Benign/Likely Benign โ†’ 0).

Exclusions: VUS (Variants of Uncertain Significance), conflicting interpretations, and incomplete records were removed.

Semantic Output Layer To prevent misuse, the model maps binary internal predictions to safe clinical language:

1 (Internal) โ†’ "High Pathogenic Indication"

0 (Internal) โ†’ "Low Pathogenic Indication"


๐Ÿ›‘ Limitations

Scope: Validated only for SNVs. Performance on Indels, CNVs, or Structural Variants is unknown and likely poor.

Binary Output: The model does not currently handle VUS (Variants of Uncertain Significance) complexity.

Hallucination: Like all LLMs, the model can hallucinate. It is not a clinical diagnosis tool


๐Ÿ”— Resources

GitHub Repository: YADAV1825/PathoPreter

ClinVar Database: NCBI ClinVar

Downloads last month
56
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for YADAV0206/Qwen-3-4B-finetuned-PathoPreter-Rohit

Adapter
(127)
this model