Model Card for KAU-BioMedLLM

KAU-BioMedLLM is a research prototype for source-grounded biomedical variant interpretation. The current public release contains the LoRA adapter and documentation for a guarded report-generation system built around a curated biomedical evidence panel, citation enforcement, and abstention when evidence is insufficient.

Scope note (read first). This project contains two deliberately separated parts: (1) an independent label-free score model used for temporal benchmarking, and (2) this LLM report generator. The numeric score field inside a generated report is an evidence-label organizer derived from the supplied source labels — it is NOT the independent score model and NOT a calibrated pathogenicity probability. The biomedical evidence is a fixed, manually curated 20-gene panel, not a genome-wide or retrieval-augmented (RAG) index.

This repository contains two adapter releases:

  • Root adapter: Llama-3.1-8B-Instruct v0.2 LoRA adapter, recommended baseline for report generation.
  • qwen_v0_1/ adapter: Qwen2.5-1.5B-Instruct v0.1 LoRA adapter, lightweight prototype/demo fallback.

This repository does not redistribute full Llama or Qwen base model weights. Users must separately load the relevant base model according to its license terms.

Table of Contents

Model Details

  • Model name: KAU-BioMedLLM
  • Repository: Babajaan/KAU-BioMedLLM
  • Developed by: Babajan B, King Abdulaziz University research environment
  • Model type: Biomedical instruction-tuned LoRA adapter for report generation
  • Primary base model: meta-llama/Llama-3.1-8B-Instruct
  • Prototype baseline: Qwen/Qwen2.5-1.5B-Instruct
  • Language: English
  • Domain: Biomedical variant interpretation, clinical genomics research, bioinformatics evidence synthesis
  • Release type: Research-use adapter release
  • Clinical status: Not clinically validated; not for diagnosis, treatment, or patient-management decisions

Repository Layout

Path Adapter Base model Intended role
repository root Llama v0.2 LoRA meta-llama/Llama-3.1-8B-Instruct Recommended report-generation baseline
qwen_v0_1/ Qwen v0.1 LoRA Qwen/Qwen2.5-1.5B-Instruct Lightweight prototype and demo fallback

The root adapter is kept as the primary Hugging Face adapter so standard PEFT loading with Babajaan/KAU-BioMedLLM loads the Llama v0.2 adapter. The Qwen adapter is included in the same repository under qwen_v0_1/ to avoid maintaining a second model repo.

Model Description

KAU-BioMedLLM was developed as part of a broader biomedical AI project to generate structured, source-grounded variant interpretation reports. The system combines two deliberately separated components:

  1. Independent score model: a label-free machine learning model for variant scoring and temporal benchmarking.
  2. LLM report generator: a LoRA-adapted instruction model for structured biomedical explanation.

The central design choice is to avoid treating LLM-generated prose as a calibrated pathogenicity score. The numeric score and the narrative interpretation are separated, and the report-generation layer is guarded by schema validation, citation checks, and abstention rules.

System Overview

The broader KAU-BioMedLLM workflow uses:

  • ClinVar current and archived snapshots for variant labels and temporal validation.
  • PubMed and biomedical literature evidence for gene and phenotype context.
  • UniProt for protein/gene functional evidence.
  • AlphaFold structure information for residue-level structural context where mapping is reliable.
  • Ensembl/VEP-style transcript and protein residue mapping.
  • A guarded report pipeline that enforces JSON structure, citation support, and abstention when evidence is missing or ambiguous.

Implementation scope. The current evidence layer is a fixed, manually curated panel covering 20 genes (UniProt + PubMed context, with AlphaFold structures retrieved for 18/20). There is no dynamic document retrieval, embedding index, or RAG system in the released pipeline; references to "evidence" mean this static curated panel. Genome-wide retrieval is future work.

The uploaded model artifact is the LoRA adapter used for the report-generation component. The repository also includes data provenance notes, evaluation summaries, manuscript tables, and inference instructions.

Uses

Intended Use

KAU-BioMedLLM is intended for research and development in:

  • biomedical variant interpretation workflows;
  • genomics and bioinformatics report prototyping;
  • source-grounded biomedical evidence synthesis;
  • evaluating guarded LLM reporting pipelines;
  • studying leakage-aware score model design and temporal validation.

Direct Use

The adapter can be loaded with meta-llama/Llama-3.1-8B-Instruct using PEFT. It can generate structured research reports when provided with variant and evidence context.

Downstream Use

The model may be used as a starting point for research on:

  • citation-grounded biomedical report generation;
  • variant prioritization interfaces;
  • guarded medical LLM output;
  • separation of statistical score models from narrative explanation;
  • biomedical uncertainty and abstention behavior.

Out-of-Scope Use

KAU-BioMedLLM must not be used as:

  • a clinical diagnostic system;
  • a substitute for expert variant curation;
  • a medical advice generator;
  • a system for patient-management or treatment decisions;
  • a standalone pathogenicity classifier for clinical reporting;
  • a validated replacement for ACMG/AMP interpretation workflows.

Training and Development Data

The project used biomedical and genomics resources including:

  • ClinVar current and archived variant snapshots;
  • PubMed/literature-derived evidence examples;
  • UniProt protein/gene evidence;
  • AlphaFold structural evidence for selected proteins;
  • Ensembl/VEP-style transcript and residue mapping outputs;
  • internally generated structured instruction examples for report generation.

The score model was intentionally designed around label-free features and excludes direct ClinVar significance text, review-star strength, gene identity, and external predictor scores such as REVEL, AlphaMissense, and CADD as training inputs.

See reports/DATA_PROVENANCE.md for project-level data provenance and known version-recording limitations.

Training Procedure

The report-generation model was trained as a LoRA adapter rather than as a full base-model retraining run.

The project developed two model stages:

  • Qwen2.5-1.5B-Instruct v0.1: lightweight prototype used to validate the basic structured-output and guarded-report pipeline.
  • Llama-3.1-8B-Instruct v0.2: stronger report-generation baseline selected for improved biomedical reasoning capacity and structured report generation.

The LoRA approach keeps the released artifact compact and avoids redistributing the full base model weights.

Evaluation

The project separates score model evaluation from LLM report-generation evaluation.

Broad First-Appearance Temporal Score Model

The broad score model was evaluated using a first-appearance ClinVar temporal split:

  • Training snapshot: ClinVar 2023-12-26
  • Test snapshot: ClinVar 2025-12-21
  • Test set: labeled variants appearing in the 2025 snapshot and absent from the 2023 snapshot

Reported benchmark:

Metric Value
Test variants 509,974
AUROC 0.99125
AUPRC 0.96938
Brier score 0.02903
ECE 0.03428

These metrics apply to the broad score model, not to the LLM text generator alone.

Per-Consequence Breakdown

The broad AUROC is not sufficient by itself because consequence classes differ strongly in class balance and difficulty. The table below reports the first-appearance temporal score model by consequence class. It is included to make clear that the headline broad score is partly consequence-class-driven.

Consequence N Pos Neg AUROC AUPRC Brier ECE
non-coding_transcript_variant 34,607 257 34,350 0.91551 0.61517 0.00398 0.00219
3_prime_UTR_variant 3,861 24 3,837 0.89588 0.13295 0.00623 0.00464
other low-N 534 336 198 0.87042 0.90320 0.13473 0.05293
unknown 1,503 542 961 0.86056 0.75580 0.14478 0.05336
missense_variant 67,847 12,943 54,904 0.81292 0.55438 0.18069 0.23174
splice_donor_variant 8,438 8,238 200 0.81240 0.99205 0.01565 0.00245
5_prime_UTR_variant 6,609 68 6,541 0.77156 0.23015 0.00891 0.00739
inframe_deletion 386 219 167 0.76963 0.80047 0.20247 0.08748
initiator_codon_variant 713 615 98 0.74360 0.93131 0.09672 0.04770
intron_variant 153,601 951 152,650 0.73241 0.03627 0.00619 0.00863
frameshift_variant 38,296 37,947 349 0.72617 0.99567 0.00786 0.00094
nonsense 22,375 22,034 341 0.70115 0.99182 0.01238 0.00779
splice_acceptor_variant 7,209 7,081 128 0.70025 0.98999 0.01468 0.00421
synonymous_variant 163,995 121 163,874 0.64395 0.00175 0.00074 0.00078

Missense variants are materially harder than the broad all-variant benchmark. This is why the project reports missense-specific limitations separately and does not claim state-of-the-art missense prediction.

Missense-Specific Limitation

Missense-only prediction remained substantially harder:

Model / Predictor Missense AUROC
score_model v0 0.78497
missense v1 protein-feature score_model 0.83644
AlphaMissense same-row comparison 0.97214
REVEL same-row comparison 0.97832

This limitation is important: KAU-BioMedLLM should not be claimed as state-of-the-art for missense pathogenicity prediction.

Report Guard Pilot (n=5, smoke test — not a benchmark)

The guarded report layer was exercised on a 5-variant pilot set. These numbers demonstrate that the guard/abstention machinery runs end-to-end; they are not a statistical benchmark of report quality, and a larger evaluation is future work.

  • Citation guard pass: 5/5 reports
  • Citation guard fail: 0/5 reports
  • Structure mapping available: 1/5 reports
  • Structure mapping abstained: 4/5 reports

The abstention behavior is intentional: when evidence is missing, ambiguous, or not reliably mapped, the system should state uncertainty rather than fabricate confidence. Note the structure-mapping rate: on fresh/novel variants only 1/5 mapped (4/5 abstained), versus 4/5 on a curated positive-control set — so residue-level structural context is frequently unavailable for new inputs.

Example Guarded Report

The example below illustrates the intended report style: the statistical score-model field is separated from the LLM narrative, biomedical claims include source identifiers, and missing evidence triggers abstention rather than unsupported interpretation.

{
  "gene": "MC4R",
  "variant": "chr18:60371172:C>T",
  "final_interpretation": "Uncertain significance",
  "score_model_prediction": {
    "status": "abstain",
    "reason": "variant_not_in_existing_score_prediction_tables",
    "message": "No precomputed score_model prediction was available for this report input. Full arbitrary-variant scoring is deferred to v2.",
    "not_for_clinical_diagnosis": true
  },
  "llm_interpretation": {
    "source": "KAU-BioMedLLM report generator",
    "reasoning_summary": "The variant is represented as a missense variation in MC4R. The available source label is uncertain significance. This does not establish clinical causality without transcript-level validation, allele frequency, segregation, phenotype match, functional data, and ACMG/AMP evidence review.",
    "limitations": "Research record only. Not a clinical diagnosis."
  },
  "evidence_examples": {
    "gene_function": "MC4R functions in the leptin-melanocortin pathway and regulates energy homeostasis.",
    "source_ids": ["UniProt:P32245", "PMID:12646665", "PMID:32327598", "PMID:33858992"],
    "structure_context": "Use only when transcript, protein accession, and residue mapping can be reconciled."
  },
  "guard_status": {
    "citation_check": "pass",
    "abstention_allowed": true,
    "clinical_use": "not permitted"
  }
}

Full guarded report examples and summaries are included in the repository under reports/full_reports/.

Limitations

  • The current evidence panel is a fixed, manually curated 20-gene set; it is not genome-wide and there is no dynamic retrieval / RAG index.
  • The numeric score field in a report is an evidence-label organizer, not a calibrated probability and not the independent score model output.
  • The report-generation guard results are a 5-variant pilot, not a statistical benchmark.
  • Missense prediction remains below specialist predictors such as REVEL and AlphaMissense.
  • Broad score model performance is partly driven by easier consequence classes.
  • Structure evidence is only used when transcript/protein/residue mapping can be reconciled.
  • Exact historical versions for some external resources require final manual verification before manuscript submission.
  • The model has not been prospectively validated in clinical settings.
  • Generated text can still contain errors and must be reviewed by domain experts.
  • The uploaded artifact is an adapter, not a standalone full model.

How to Load the Adapters

Llama v0.2 Adapter

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter_model = "Babajaan/KAU-BioMedLLM"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()

messages = [
    {
        "role": "user",
        "content": (
            "For research use only, generate a structured biomedical variant "
            "interpretation report for BRCA1 c.5096G>A p.Arg1699Gln. "
            "Include uncertainty and do not provide clinical advice."
        ),
    }
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=700,
        temperature=0.2,
        do_sample=True,
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Qwen v0.1 Adapter

The Qwen adapter is stored in the same repository under qwen_v0_1/. Use it for lightweight demos when Llama-3.1-8B hardware is unavailable.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model = "Babajaan/KAU-BioMedLLM"
adapter_subfolder = "qwen_v0_1"

tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto",
)
model = PeftModel.from_pretrained(
    model,
    adapter_model,
    subfolder=adapter_subfolder,
)
model.eval()

Compute Infrastructure

The project was developed and evaluated using the King Abdulaziz University Aziz High Performance Computing environment. The workflow used the Aziz HPC software environment, PBS job scheduling, and GPU resources including A100 GPU jobs where available.

KAU Aziz HPC support resources:

  • Aziz Supercomputer Support Center: https://hpcc-kau.com/support/
  • PBS and GPU job support documentation are available through the Aziz HPC knowledge base.

Ethical and Safety Considerations

Biomedical LLM systems can produce fluent but incorrect explanations. KAU-BioMedLLM therefore uses a guard-first design:

  • numeric score and LLM narrative are separated;
  • citations are required for biomedical claims;
  • reports can abstain when evidence is insufficient;
  • outputs are research-use only;
  • no clinical decision should be made from model output.

Users should treat all outputs as draft research interpretations requiring expert review.

Citation and Acknowledgement

If you use this model or the associated code/reports, please cite the forthcoming project manuscript or repository when available.

Manuscript status: in preparation. A formal citation will be added after manuscript submission or publication.

Suggested acknowledgement:

This work used computational resources from the King Abdulaziz University Aziz High Performance Computing environment.

License and Base Model Notes

This repository contains LoRA adapters and project documentation. It does not redistribute full Llama or Qwen base model weights. Use of the root adapter with meta-llama/Llama-3.1-8B-Instruct is subject to the Meta Llama 3.1 Community License and Acceptable Use Policy. Use of the qwen_v0_1/ adapter with Qwen/Qwen2.5-1.5B-Instruct is subject to Qwen license terms. Users are responsible for verifying license compatibility for their own use case.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Babajaan/KAU-BioMedLLM

Adapter
(2461)
this model

Space using Babajaan/KAU-BioMedLLM 1