File size: 6,729 Bytes
2b34402 b130812 2b34402 b130812 2b34402 b130812 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
---
license: cc-by-nc-sa-4.0
library_name: pytorch
tags:
- proteomics
- mass-spectrometry
- peptide-sequencing
- de-novo-sequencing
- diffusion
- multinomial-diffusion
- biology
- computational-biology
pipeline_tag: text-generation
datasets:
- InstaDeepAI/ms_ninespecies_benchmark
- InstaDeepAI/ms_proteometools
---
# InstaNovoPlus: Diffusion-Powered De novo Peptide Sequencing Model
## Model Description
InstaNovoPlus is a diffusion-based model for de novo peptide sequencing from mass spectrometry data. This model leverages multinomial diffusion for accurate, database-free peptide identification for large-scale proteomics experiments.
## Usage
```python
import torch
import numpy as np
import pandas as pd
from instanovo.diffusion.multinomial_diffusion import InstaNovoPlus
from instanovo.utils import SpectrumDataFrame
from instanovo.transformer.dataset import SpectrumDataset, collate_batch
from torch.utils.data import DataLoader
from instanovo.inference import ScoredSequence
from instanovo.inference.diffusion import DiffusionDecoder
from instanovo.utils.metrics import Metrics
from tqdm.notebook import tqdm
# Load the model from the Hugging Face Hub
model, config = InstaNovoPlus.from_pretrained("InstaDeepAI/instanovoplus-v1.1.0")
# Move the model to the GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device).eval()
# Update the residue set with custom modifications
model.residue_set.update_remapping(
{
"M(ox)": "M[UNIMOD:35]",
"M(+15.99)": "M[UNIMOD:35]",
"S(p)": "S[UNIMOD:21]", # Phosphorylation
"T(p)": "T[UNIMOD:21]",
"Y(p)": "Y[UNIMOD:21]",
"S(+79.97)": "S[UNIMOD:21]",
"T(+79.97)": "T[UNIMOD:21]",
"Y(+79.97)": "Y[UNIMOD:21]",
"Q(+0.98)": "Q[UNIMOD:7]", # Deamidation
"N(+0.98)": "N[UNIMOD:7]",
"Q(+.98)": "Q[UNIMOD:7]",
"N(+.98)": "N[UNIMOD:7]",
"C(+57.02)": "C[UNIMOD:4]", # Carboxyamidomethylation
"(+42.01)": "[UNIMOD:1]", # Acetylation
"(+43.01)": "[UNIMOD:5]", # Carbamylation
"(-17.03)": "[UNIMOD:385]",
}
)
# Load the test data
sdf = SpectrumDataFrame.from_huggingface(
"InstaDeepAI/ms_ninespecies_benchmark",
is_annotated=True,
shuffle=False,
split="test[:10%]", # Let's only use a subset of the test data for faster inference
)
# Create the dataset
ds = SpectrumDataset(
sdf,
model.residue_set,
config.get("n_peaks", 200),
return_str=False,
annotated=True,
peptide_pad_length=model.config.get("max_length", 30),
reverse_peptide=False, # we do not reverse peptide for diffusion
add_eos=False,
tokenize_peptide=True,
)
# Create the data loader
dl = DataLoader(
ds,
batch_size=64,
num_workers=0, # sdf requirement, handled internally
shuffle=False, # sdf requirement, handled internally
collate_fn=collate_batch,
)
# Create the decoder
diffusion_decoder = DiffusionDecoder(model=model)
predictions = []
log_probs = []
# Iterate over the data loader
for batch in tqdm(dl, total=len(dl)):
spectra, precursors, spectra_padding_mask, peptides, _ = batch
spectra = spectra.to(device)
precursors = precursors.to(device)
spectra_padding_mask = spectra_padding_mask.to(device)
peptides = peptides.to(device)
# Perform inference
with torch.no_grad():
batch_predictions, batch_log_probs = diffusion_decoder.decode(
spectra=spectra,
spectra_padding_mask=spectra_padding_mask,
precursors=precursors,
initial_sequence=peptides,
)
predictions.extend(batch_predictions)
log_probs.extend(batch_log_probs)
# Initialize metrics
metrics = Metrics(model.residue_set, config["isotope_error_range"])
# Compute precision and recall
aa_precision, aa_recall, peptide_recall, peptide_precision = metrics.compute_precision_recall(
peptides, preds
)
# Compute amino acid error rate and AUC
aa_error_rate = metrics.compute_aa_er(targs, preds)
auc = metrics.calc_auc(targs, preds, np.exp(pd.Series(probs)))
print(f"amino acid error rate: {aa_error_rate:.5f}")
print(f"amino acid precision: {aa_precision:.5f}")
print(f"amino acid recall: {aa_recall:.5f}")
print(f"peptide precision: {peptide_precision:.5f}")
print(f"peptide recall: {peptide_recall:.5f}")
print(f"area under the PR curve: {auc:.5f}")
```
For more explanation, see the [Getting Started notebook](https://github.com/instadeepai/InstaNovo/blob/main/notebooks/getting_started_with_instanovo.ipynb) in the repository.
## Citation
If you use InstaNovoPlus in your research, please cite:
```bibtex
@article{eloff_kalogeropoulos_2025_instanovo,
title = {InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale
proteomics experiments},
author = {Eloff, Kevin and Kalogeropoulos, Konstantinos and Mabona, Amandla and Morell,
Oliver and Catzel, Rachel and Rivera-de-Torre, Esperanza and Berg Jespersen,
Jakob and Williams, Wesley and van Beljouw, Sam P. B. and Skwark, Marcin J.
and Laustsen, Andreas Hougaard and Brouns, Stan J. J. and Ljungars,
Anne and Schoof, Erwin M. and Van Goey, Jeroen and auf dem Keller, Ulrich and
Beguir, Karim and Lopez Carranza, Nicolas and Jenkins, Timothy P.},
year = {2025},
month = {Mar},
day = {31},
journal = {Nature Machine Intelligence},
doi = {10.1038/s42256-025-01019-5},
issn = {2522-5839},
url = {https://doi.org/10.1038/s42256-025-01019-5}
}
```
## Resources
- **Code Repository**: [https://github.com/instadeepai/InstaNovo](https://github.com/instadeepai/InstaNovo)
- **Documentation**: [https://instadeepai.github.io/InstaNovo/](https://instadeepai.github.io/InstaNovo/)
- **Publication**: [https://www.nature.com/articles/s42256-025-01019-5](https://www.nature.com/articles/s42256-025-01019-5)
## License
- **Code**: Licensed under Apache License 2.0
- **Model Checkpoints**: Licensed under Creative Commons Non-Commercial (CC BY-NC-SA 4.0)
## Installation
```bash
pip install instanovo
```
For GPU support, install with CUDA dependencies:
```bash
pip install instanovo[cu126]
```
## Requirements
- Python >= 3.10, < 3.13
- PyTorch >= 1.13.0
- CUDA (optional, for GPU acceleration)
## Support
For questions, issues, or contributions, please visit the [GitHub repository](https://github.com/instadeepai/InstaNovo) or check the [documentation](https://instadeepai.github.io/InstaNovo/).
|