biology

Gengram: Retrieval-Augmented Genomic Foundation Models

1. Introduction

Gengram is a novel conditional memory module designed for genomic foundation models (GFMs) that introduces explicit motif memory retrieval to enhance Transformer-based DNA sequence modeling. Unlike traditional GFMs that rely on dense computation to implicitly infer multi-nucleotide motifs, Gengram provides an efficient lookup mechanism for biological patterns through a genomic-specific hashing scheme.

Figure 1 illustrates the overall architecture of Gengram, together with the evaluation pipeline used to assess its effectiveness across multiple genomic benchmarks.

Gengram

✨ Key Features

  • 🎯 Explicit Motif Memory: Stores and retrieves k-mers (k=1-6) via hash-based lookup tables
  • 🧬 Local Window Aggregation: 21bp window mechanism aligned with DNA helical structure
  • ⚑ Computational Efficiency: Linear time complexity with minimal overhead
  • πŸ”§ Architecture Agnostic: Compatible with various attention mechanisms (MHA, GQA, MLA)
  • βš–οΈ Stable Training: Improves load balancing in Mixture-of-Experts models
  • πŸ” Biological Interpretability: Learns meaningful motif representations

✨ Biological Interpretability

Gengram exhibits clear biologically grounded behaviors, including:

  • Reverse-complement symmetry in memory embeddings
  • Context-dependent gating aligned with functional regions
  • Hierarchical representation from shallow to deep layers

2. Model Information

Model Configuration

The following details the model configuration, including the parameterization of Gengram, MoE routing strategies, and training hyperparameters used across all experiments.

  • Gengram Parameters

    These parameters control how Gengram operates within the Transformer layers, including which layers to apply it to, the n-gram sizes, and embedding dimensions.

Parameter Description Example
--gengram-enabled Enable Gengram true
--gengram-layer-ids Layers to apply Gengram 3 6 10
--gengram-ngram-sizes N-gram sizes for DNA processing 1 2 3 4 5 6
--gengram-embed-dim-per-ngram Embedding dimension per n-gram 1024
--gengram-window-size window size 21
  • Mixture of Experts (MoE)

    These parameters define the Mixture-of-Experts architecture, including the number of experts, routing top-k, and load balancing strategies during training.

Parameter Description Default
--num-experts Number of experts 8
--moe-router-topk Top-k experts to route to 2
--moe-router-load-balancing-type Load balancing strategy aux_loss
--moe-aux-loss-coeff Auxiliary loss coefficient 1e-3
  • Training Parameters

    These parameters specify the training setup, including sequence length, batch sizes, precision, and attention optimizations.

Parameter Description Example
--seq-length Maximum sequence length 8192
--micro-batch-size Micro batch size per GPU 1
--global-batch-size Global batch size across all GPUs 1024
--bf16 Use BF16 precision true
--use-flash-attn Enable Flash Attention true

Pre-training Data

  • Human Sequences: HPRC Release 2, GRCh38, CHM13
  • Non-human Primates: NCBI RefSeq database
  • Total: 200B tokens (8k context) + 100B tokens (32k context)

3. Performance Evaluation

Gengram demonstrates strong performance across multiple genomic benchmarks, achieving competitive results despite being trained on significantly fewer tokens and with a smaller model size.

Metric Gengram-10B Genos-10B Evo2-40B
Trained Tokens 200B 2.2T 9.3T
Multi-species Exon Classification 0.9832 0.9755 0.9332
Splice Site Identification 0.9009 0.7990 0.9138
Human OCR Ensembl 0.7714 0.7623 0.7635
  • Key Observations

    • Data Efficiency: Achieves comparable performance using ~10×–40Γ— fewer tokens
    • Motif-Dominated Tasks: Up to 14% improvement
    • Long-Context Modeling: Enhanced performance with shorter sequences
    • Training Efficiency: Better parameter utilization and stable MoE training
  • Evaluation Benchmarks

    • Genomic Benchmarks (GB)
    • Nucleotide Transformer Benchmarks (NTB)
    • Long-Range Benchmarks (LRB)
    • Genos Benchmarks (GeB)

4. Quickstart

Model Download

Gengram model is available for download from Hugging Face. We provide torch version.

Model Activated Params Hugging Face Format
Gengram-10B 2.87 B πŸ€— Hugging Face torch

Pre-training

Run the pre-training script with the following command:

cd Gengram
bash Gengram_layer3-6-10_win21_pp2.sh

5. License

This repository and the Gengram model weights are licensed under the Apache License 2.0.

Please note that the primary use of Gengram model is to support genomics research, providing researchers with advanced analytical capabilities and long-context modeling tools powered by large-scale foundation models for the human genome. It is not intended for use in any manner that violates applicable laws or regulations, nor for any activities prohibited by the license agreement.

6. Citation and Acknowledgements

We acknowledge the high-quality sequencing data provided by CycloneSEQ, which forms an important foundation for this work. We also appreciate the inspiration from DeepSeek's Engram module and the framework support provided by Megatron-LM. Model training was conducted on the 021 Science Foundation Model and Zero2X open platform.

If you use this work in your research, please cite the following paper:

@article@article{gengram2026,
  title={Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram},
  author={Genos Team and Xu, Huinan and Feng, Xuyang and Chen, Junhong and Liu Junchen and Deng, Kaiwen and Ding, Kai and Long, Shengning and Shuai, Jiaxue and Li, Zhaorong and Liu, Shiping and Xue, Guirong and Xiao, Zhan},
  journal={arXiv preprint arXiv:2601.22203},
  year={2026}
}

7. Contact

For project-related questions, please open an issue. You can also contact the Genos Team at [email protected].

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for BGI-HangzhouAI/Gengram