Gengram: Retrieval-Augmented Genomic Foundation Models

1. Introduction

Gengram is a novel conditional memory module designed for genomic foundation models (GFMs) that introduces explicit motif memory retrieval to enhance Transformer-based DNA sequence modeling. Unlike traditional GFMs that rely on dense computation to implicitly infer multi-nucleotide motifs, Gengram provides an efficient lookup mechanism for biological patterns through a genomic-specific hashing scheme.

Figure 1 illustrates the overall architecture of Gengram, together with the evaluation pipeline used to assess its effectiveness across multiple genomic benchmarks.

✨ Key Features

🎯 Explicit Motif Memory: Stores and retrieves k-mers (k=1-6) via hash-based lookup tables
🧬 Local Window Aggregation: 21bp window mechanism aligned with DNA helical structure
⚡ Computational Efficiency: Linear time complexity with minimal overhead
🔧 Architecture Agnostic: Compatible with various attention mechanisms (MHA, GQA, MLA)
⚖️ Stable Training: Improves load balancing in Mixture-of-Experts models
🔍 Biological Interpretability: Learns meaningful motif representations

✨ Biological Interpretability

Gengram exhibits clear biologically grounded behaviors, including:

Reverse-complement symmetry in memory embeddings
Context-dependent gating aligned with functional regions
Hierarchical representation from shallow to deep layers

2. Model Information

Model Configuration

The following details the model configuration, including the parameterization of Gengram, MoE routing strategies, and training hyperparameters used across all experiments.

Gengram Parameters

These parameters control how Gengram operates within the Transformer layers, including which layers to apply it to, the n-gram sizes, and embedding dimensions.

Parameter	Description	Example
`--gengram-enabled`	Enable Gengram	`true`
`--gengram-layer-ids`	Layers to apply Gengram	`3 6 10`
`--gengram-ngram-sizes`	N-gram sizes for DNA processing	`1 2 3 4 5 6`
`--gengram-embed-dim-per-ngram`	Embedding dimension per n-gram	`1024`
`--gengram-window-size`	window size	`21`

Mixture of Experts (MoE)

These parameters define the Mixture-of-Experts architecture, including the number of experts, routing top-k, and load balancing strategies during training.

Parameter	Description	Default
`--num-experts`	Number of experts	`8`
`--moe-router-topk`	Top-k experts to route to	`2`
`--moe-router-load-balancing-type`	Load balancing strategy	`aux_loss`
`--moe-aux-loss-coeff`	Auxiliary loss coefficient	`1e-3`

Training Parameters

These parameters specify the training setup, including sequence length, batch sizes, precision, and attention optimizations.

Parameter	Description	Example
`--seq-length`	Maximum sequence length	`8192`
`--micro-batch-size`	Micro batch size per GPU	`1`
`--global-batch-size`	Global batch size across all GPUs	`1024`
`--bf16`	Use BF16 precision	`true`
`--use-flash-attn`	Enable Flash Attention	`true`

Pre-training Data

Human Sequences: HPRC Release 2, GRCh38, CHM13
Non-human Primates: NCBI RefSeq database
Total: 200B tokens (8k context) + 100B tokens (32k context)

3. Performance Evaluation

Gengram demonstrates strong performance across multiple genomic benchmarks, achieving competitive results despite being trained on significantly fewer tokens and with a smaller model size.

Metric	Gengram-10B	Genos-10B	Evo2-40B
Trained Tokens	200B	2.2T	9.3T
Multi-species Exon Classification	0.9832	0.9755	0.9332
Splice Site Identification	0.9009	0.7990	0.9138
Human OCR Ensembl	0.7714	0.7623	0.7635

Key Observations
- Data Efficiency: Achieves comparable performance using ~10×–40× fewer tokens
- Motif-Dominated Tasks: Up to 14% improvement
- Long-Context Modeling: Enhanced performance with shorter sequences
- Training Efficiency: Better parameter utilization and stable MoE training
Evaluation Benchmarks
- Genomic Benchmarks (GB)
- Nucleotide Transformer Benchmarks (NTB)
- Long-Range Benchmarks (LRB)
- Genos Benchmarks (GeB)

4. Quickstart

Model Download

Gengram model is available for download from Hugging Face. We provide torch version.

Model	Activated Params	Hugging Face	Format
Gengram-10B	2.87 B	🤗 Hugging Face	torch

Pre-training

Run the pre-training script with the following command:

cd Gengram
bash Gengram_layer3-6-10_win21_pp2.sh

5. License

This repository and the Gengram model weights are licensed under the Apache License 2.0.

Please note that the primary use of Gengram model is to support genomics research, providing researchers with advanced analytical capabilities and long-context modeling tools powered by large-scale foundation models for the human genome. It is not intended for use in any manner that violates applicable laws or regulations, nor for any activities prohibited by the license agreement.

6. Citation and Acknowledgements

We acknowledge the high-quality sequencing data provided by CycloneSEQ, which forms an important foundation for this work. We also appreciate the inspiration from DeepSeek's Engram module and the framework support provided by Megatron-LM. Model training was conducted on the 021 Science Foundation Model and Zero2X open platform.

If you use this work in your research, please cite the following paper:

@article@article{gengram2026,
  title={Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram},
  author={Genos Team and Xu, Huinan and Feng, Xuyang and Chen, Junhong and Liu Junchen and Deng, Kaiwen and Ding, Kai and Long, Shengning and Shuai, Jiaxue and Li, Zhaorong and Liu, Shiping and Xue, Guirong and Xiao, Zhan},
  journal={arXiv preprint arXiv:2601.22203},
  year={2026}
}

7. Contact

For project-related questions, please open an issue. You can also contact the Genos Team at [email protected].

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for BGI-HangzhouAI/Gengram

Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram

Paper • 2601.22203 • Published 5 days ago • 1