Instructions to use Babajaan/KAU-BioMedLLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Babajaan/KAU-BioMedLLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Babajaan/KAU-BioMedLLM") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Babajaan/KAU-BioMedLLM", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Babajaan/KAU-BioMedLLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Babajaan/KAU-BioMedLLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Babajaan/KAU-BioMedLLM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/Babajaan/KAU-BioMedLLM
- SGLang
How to use Babajaan/KAU-BioMedLLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Babajaan/KAU-BioMedLLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Babajaan/KAU-BioMedLLM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Babajaan/KAU-BioMedLLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Babajaan/KAU-BioMedLLM", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use Babajaan/KAU-BioMedLLM with Docker Model Runner:
docker model run hf.co/Babajaan/KAU-BioMedLLM
- Model Card for KAU-BioMedLLM
- Table of Contents
- Model Details
- Repository Layout
- Model Description
- System Overview
- Uses
- Out-of-Scope Use
- Training and Development Data
- Training Procedure
- Evaluation
- Example Guarded Report
- Limitations
- How to Load the Adapters
- Compute Infrastructure
- Ethical and Safety Considerations
- Citation and Acknowledgement
- License and Base Model Notes
- Table of Contents
Model Card for KAU-BioMedLLM
KAU-BioMedLLM is a research prototype for source-grounded biomedical variant interpretation. The current public release contains the LoRA adapter and documentation for a guarded report-generation system built around a curated biomedical evidence panel, citation enforcement, and abstention when evidence is insufficient.
Scope note (read first). This project contains two deliberately separated parts: (1) an independent label-free score model used for temporal benchmarking, and (2) this LLM report generator. The numeric
scorefield inside a generated report is an evidence-label organizer derived from the supplied source labels — it is NOT the independent score model and NOT a calibrated pathogenicity probability. The biomedical evidence is a fixed, manually curated 20-gene panel, not a genome-wide or retrieval-augmented (RAG) index.
This repository contains two adapter releases:
- Root adapter: Llama-3.1-8B-Instruct v0.2 LoRA adapter, recommended baseline for report generation.
qwen_v0_1/adapter: Qwen2.5-1.5B-Instruct v0.1 LoRA adapter, lightweight prototype/demo fallback.
This repository does not redistribute full Llama or Qwen base model weights. Users must separately load the relevant base model according to its license terms.
Table of Contents
- Model Details
- Repository Layout
- Model Description
- System Overview
- Uses
- Out-of-Scope Use
- Training and Development Data
- Training Procedure
- Evaluation
- Example Guarded Report
- Limitations
- How to Load the Adapter
- Compute Infrastructure
- Ethical and Safety Considerations
- Citation and Acknowledgement
Model Details
- Model name: KAU-BioMedLLM
- Repository:
Babajaan/KAU-BioMedLLM - Developed by: Babajan B, King Abdulaziz University research environment
- Model type: Biomedical instruction-tuned LoRA adapter for report generation
- Primary base model:
meta-llama/Llama-3.1-8B-Instruct - Prototype baseline:
Qwen/Qwen2.5-1.5B-Instruct - Language: English
- Domain: Biomedical variant interpretation, clinical genomics research, bioinformatics evidence synthesis
- Release type: Research-use adapter release
- Clinical status: Not clinically validated; not for diagnosis, treatment, or patient-management decisions
Repository Layout
| Path | Adapter | Base model | Intended role |
|---|---|---|---|
| repository root | Llama v0.2 LoRA | meta-llama/Llama-3.1-8B-Instruct |
Recommended report-generation baseline |
qwen_v0_1/ |
Qwen v0.1 LoRA | Qwen/Qwen2.5-1.5B-Instruct |
Lightweight prototype and demo fallback |
The root adapter is kept as the primary Hugging Face adapter so standard PEFT loading with Babajaan/KAU-BioMedLLM loads the Llama v0.2 adapter. The Qwen adapter is included in the same repository under qwen_v0_1/ to avoid maintaining a second model repo.
Model Description
KAU-BioMedLLM was developed as part of a broader biomedical AI project to generate structured, source-grounded variant interpretation reports. The system combines two deliberately separated components:
- Independent score model: a label-free machine learning model for variant scoring and temporal benchmarking.
- LLM report generator: a LoRA-adapted instruction model for structured biomedical explanation.
The central design choice is to avoid treating LLM-generated prose as a calibrated pathogenicity score. The numeric score and the narrative interpretation are separated, and the report-generation layer is guarded by schema validation, citation checks, and abstention rules.
System Overview
The broader KAU-BioMedLLM workflow uses:
- ClinVar current and archived snapshots for variant labels and temporal validation.
- PubMed and biomedical literature evidence for gene and phenotype context.
- UniProt for protein/gene functional evidence.
- AlphaFold structure information for residue-level structural context where mapping is reliable.
- Ensembl/VEP-style transcript and protein residue mapping.
- A guarded report pipeline that enforces JSON structure, citation support, and abstention when evidence is missing or ambiguous.
Implementation scope. The current evidence layer is a fixed, manually curated panel covering 20 genes (UniProt + PubMed context, with AlphaFold structures retrieved for 18/20). There is no dynamic document retrieval, embedding index, or RAG system in the released pipeline; references to "evidence" mean this static curated panel. Genome-wide retrieval is future work.
The uploaded model artifact is the LoRA adapter used for the report-generation component. The repository also includes data provenance notes, evaluation summaries, manuscript tables, and inference instructions.
Uses
Intended Use
KAU-BioMedLLM is intended for research and development in:
- biomedical variant interpretation workflows;
- genomics and bioinformatics report prototyping;
- source-grounded biomedical evidence synthesis;
- evaluating guarded LLM reporting pipelines;
- studying leakage-aware score model design and temporal validation.
Direct Use
The adapter can be loaded with meta-llama/Llama-3.1-8B-Instruct using PEFT. It can generate structured research reports when provided with variant and evidence context.
Downstream Use
The model may be used as a starting point for research on:
- citation-grounded biomedical report generation;
- variant prioritization interfaces;
- guarded medical LLM output;
- separation of statistical score models from narrative explanation;
- biomedical uncertainty and abstention behavior.
Out-of-Scope Use
KAU-BioMedLLM must not be used as:
- a clinical diagnostic system;
- a substitute for expert variant curation;
- a medical advice generator;
- a system for patient-management or treatment decisions;
- a standalone pathogenicity classifier for clinical reporting;
- a validated replacement for ACMG/AMP interpretation workflows.
Training and Development Data
The project used biomedical and genomics resources including:
- ClinVar current and archived variant snapshots;
- PubMed/literature-derived evidence examples;
- UniProt protein/gene evidence;
- AlphaFold structural evidence for selected proteins;
- Ensembl/VEP-style transcript and residue mapping outputs;
- internally generated structured instruction examples for report generation.
The score model was intentionally designed around label-free features and excludes direct ClinVar significance text, review-star strength, gene identity, and external predictor scores such as REVEL, AlphaMissense, and CADD as training inputs.
See reports/DATA_PROVENANCE.md for project-level data provenance and known version-recording limitations.
Training Procedure
The report-generation model was trained as a LoRA adapter rather than as a full base-model retraining run.
The project developed two model stages:
- Qwen2.5-1.5B-Instruct v0.1: lightweight prototype used to validate the basic structured-output and guarded-report pipeline.
- Llama-3.1-8B-Instruct v0.2: stronger report-generation baseline selected for improved biomedical reasoning capacity and structured report generation.
The LoRA approach keeps the released artifact compact and avoids redistributing the full base model weights.
Evaluation
The project separates score model evaluation from LLM report-generation evaluation.
Broad First-Appearance Temporal Score Model
The broad score model was evaluated using a first-appearance ClinVar temporal split:
- Training snapshot: ClinVar 2023-12-26
- Test snapshot: ClinVar 2025-12-21
- Test set: labeled variants appearing in the 2025 snapshot and absent from the 2023 snapshot
Reported benchmark:
| Metric | Value |
|---|---|
| Test variants | 509,974 |
| AUROC | 0.99125 |
| AUPRC | 0.96938 |
| Brier score | 0.02903 |
| ECE | 0.03428 |
These metrics apply to the broad score model, not to the LLM text generator alone.
Per-Consequence Breakdown
The broad AUROC is not sufficient by itself because consequence classes differ strongly in class balance and difficulty. The table below reports the first-appearance temporal score model by consequence class. It is included to make clear that the headline broad score is partly consequence-class-driven.
| Consequence | N | Pos | Neg | AUROC | AUPRC | Brier | ECE |
|---|---|---|---|---|---|---|---|
| non-coding_transcript_variant | 34,607 | 257 | 34,350 | 0.91551 | 0.61517 | 0.00398 | 0.00219 |
| 3_prime_UTR_variant | 3,861 | 24 | 3,837 | 0.89588 | 0.13295 | 0.00623 | 0.00464 |
| other low-N | 534 | 336 | 198 | 0.87042 | 0.90320 | 0.13473 | 0.05293 |
| unknown | 1,503 | 542 | 961 | 0.86056 | 0.75580 | 0.14478 | 0.05336 |
| missense_variant | 67,847 | 12,943 | 54,904 | 0.81292 | 0.55438 | 0.18069 | 0.23174 |
| splice_donor_variant | 8,438 | 8,238 | 200 | 0.81240 | 0.99205 | 0.01565 | 0.00245 |
| 5_prime_UTR_variant | 6,609 | 68 | 6,541 | 0.77156 | 0.23015 | 0.00891 | 0.00739 |
| inframe_deletion | 386 | 219 | 167 | 0.76963 | 0.80047 | 0.20247 | 0.08748 |
| initiator_codon_variant | 713 | 615 | 98 | 0.74360 | 0.93131 | 0.09672 | 0.04770 |
| intron_variant | 153,601 | 951 | 152,650 | 0.73241 | 0.03627 | 0.00619 | 0.00863 |
| frameshift_variant | 38,296 | 37,947 | 349 | 0.72617 | 0.99567 | 0.00786 | 0.00094 |
| nonsense | 22,375 | 22,034 | 341 | 0.70115 | 0.99182 | 0.01238 | 0.00779 |
| splice_acceptor_variant | 7,209 | 7,081 | 128 | 0.70025 | 0.98999 | 0.01468 | 0.00421 |
| synonymous_variant | 163,995 | 121 | 163,874 | 0.64395 | 0.00175 | 0.00074 | 0.00078 |
Missense variants are materially harder than the broad all-variant benchmark. This is why the project reports missense-specific limitations separately and does not claim state-of-the-art missense prediction.
Missense-Specific Limitation
Missense-only prediction remained substantially harder:
| Model / Predictor | Missense AUROC |
|---|---|
| score_model v0 | 0.78497 |
| missense v1 protein-feature score_model | 0.83644 |
| AlphaMissense same-row comparison | 0.97214 |
| REVEL same-row comparison | 0.97832 |
This limitation is important: KAU-BioMedLLM should not be claimed as state-of-the-art for missense pathogenicity prediction.
Report Guard Pilot (n=5, smoke test — not a benchmark)
The guarded report layer was exercised on a 5-variant pilot set. These numbers demonstrate that the guard/abstention machinery runs end-to-end; they are not a statistical benchmark of report quality, and a larger evaluation is future work.
- Citation guard pass: 5/5 reports
- Citation guard fail: 0/5 reports
- Structure mapping available: 1/5 reports
- Structure mapping abstained: 4/5 reports
The abstention behavior is intentional: when evidence is missing, ambiguous, or not reliably mapped, the system should state uncertainty rather than fabricate confidence. Note the structure-mapping rate: on fresh/novel variants only 1/5 mapped (4/5 abstained), versus 4/5 on a curated positive-control set — so residue-level structural context is frequently unavailable for new inputs.
Example Guarded Report
The example below illustrates the intended report style: the statistical score-model field is separated from the LLM narrative, biomedical claims include source identifiers, and missing evidence triggers abstention rather than unsupported interpretation.
{
"gene": "MC4R",
"variant": "chr18:60371172:C>T",
"final_interpretation": "Uncertain significance",
"score_model_prediction": {
"status": "abstain",
"reason": "variant_not_in_existing_score_prediction_tables",
"message": "No precomputed score_model prediction was available for this report input. Full arbitrary-variant scoring is deferred to v2.",
"not_for_clinical_diagnosis": true
},
"llm_interpretation": {
"source": "KAU-BioMedLLM report generator",
"reasoning_summary": "The variant is represented as a missense variation in MC4R. The available source label is uncertain significance. This does not establish clinical causality without transcript-level validation, allele frequency, segregation, phenotype match, functional data, and ACMG/AMP evidence review.",
"limitations": "Research record only. Not a clinical diagnosis."
},
"evidence_examples": {
"gene_function": "MC4R functions in the leptin-melanocortin pathway and regulates energy homeostasis.",
"source_ids": ["UniProt:P32245", "PMID:12646665", "PMID:32327598", "PMID:33858992"],
"structure_context": "Use only when transcript, protein accession, and residue mapping can be reconciled."
},
"guard_status": {
"citation_check": "pass",
"abstention_allowed": true,
"clinical_use": "not permitted"
}
}
Full guarded report examples and summaries are included in the repository under reports/full_reports/.
Limitations
- The current evidence panel is a fixed, manually curated 20-gene set; it is not genome-wide and there is no dynamic retrieval / RAG index.
- The numeric
scorefield in a report is an evidence-label organizer, not a calibrated probability and not the independent score model output. - The report-generation guard results are a 5-variant pilot, not a statistical benchmark.
- Missense prediction remains below specialist predictors such as REVEL and AlphaMissense.
- Broad score model performance is partly driven by easier consequence classes.
- Structure evidence is only used when transcript/protein/residue mapping can be reconciled.
- Exact historical versions for some external resources require final manual verification before manuscript submission.
- The model has not been prospectively validated in clinical settings.
- Generated text can still contain errors and must be reviewed by domain experts.
- The uploaded artifact is an adapter, not a standalone full model.
How to Load the Adapters
Llama v0.2 Adapter
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = "meta-llama/Llama-3.1-8B-Instruct"
adapter_model = "Babajaan/KAU-BioMedLLM"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()
messages = [
{
"role": "user",
"content": (
"For research use only, generate a structured biomedical variant "
"interpretation report for BRCA1 c.5096G>A p.Arg1699Gln. "
"Include uncertainty and do not provide clinical advice."
),
}
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=700,
temperature=0.2,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Qwen v0.1 Adapter
The Qwen adapter is stored in the same repository under qwen_v0_1/. Use it for lightweight demos when Llama-3.1-8B hardware is unavailable.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base_model = "Qwen/Qwen2.5-1.5B-Instruct"
adapter_model = "Babajaan/KAU-BioMedLLM"
adapter_subfolder = "qwen_v0_1"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(
model,
adapter_model,
subfolder=adapter_subfolder,
)
model.eval()
Compute Infrastructure
The project was developed and evaluated using the King Abdulaziz University Aziz High Performance Computing environment. The workflow used the Aziz HPC software environment, PBS job scheduling, and GPU resources including A100 GPU jobs where available.
KAU Aziz HPC support resources:
- Aziz Supercomputer Support Center: https://hpcc-kau.com/support/
- PBS and GPU job support documentation are available through the Aziz HPC knowledge base.
Ethical and Safety Considerations
Biomedical LLM systems can produce fluent but incorrect explanations. KAU-BioMedLLM therefore uses a guard-first design:
- numeric score and LLM narrative are separated;
- citations are required for biomedical claims;
- reports can abstain when evidence is insufficient;
- outputs are research-use only;
- no clinical decision should be made from model output.
Users should treat all outputs as draft research interpretations requiring expert review.
Citation and Acknowledgement
If you use this model or the associated code/reports, please cite the forthcoming project manuscript or repository when available.
Manuscript status: in preparation. A formal citation will be added after manuscript submission or publication.
Suggested acknowledgement:
This work used computational resources from the King Abdulaziz University Aziz High Performance Computing environment.
License and Base Model Notes
This repository contains LoRA adapters and project documentation. It does not redistribute full Llama or Qwen base model weights. Use of the root adapter with meta-llama/Llama-3.1-8B-Instruct is subject to the Meta Llama 3.1 Community License and Acceptable Use Policy. Use of the qwen_v0_1/ adapter with Qwen/Qwen2.5-1.5B-Instruct is subject to Qwen license terms. Users are responsible for verifying license compatibility for their own use case.
Model tree for Babajaan/KAU-BioMedLLM
Base model
meta-llama/Llama-3.1-8B