Gengram: Retrieval-Augmented Genomic Foundation Models
1. Introduction
Gengram is a novel conditional memory module designed for genomic foundation models (GFMs) that introduces explicit motif memory retrieval to enhance Transformer-based DNA sequence modeling. Unlike traditional GFMs that rely on dense computation to implicitly infer multi-nucleotide motifs, Gengram provides an efficient lookup mechanism for biological patterns through a genomic-specific hashing scheme.
Figure 1 illustrates the overall architecture of Gengram, together with the evaluation pipeline used to assess its effectiveness across multiple genomic benchmarks.
β¨ Key Features
- π― Explicit Motif Memory: Stores and retrieves k-mers (k=1-6) via hash-based lookup tables
- 𧬠Local Window Aggregation: 21bp window mechanism aligned with DNA helical structure
- β‘ Computational Efficiency: Linear time complexity with minimal overhead
- π§ Architecture Agnostic: Compatible with various attention mechanisms (MHA, GQA, MLA)
- βοΈ Stable Training: Improves load balancing in Mixture-of-Experts models
- π Biological Interpretability: Learns meaningful motif representations
β¨ Biological Interpretability
Gengram exhibits clear biologically grounded behaviors, including:
- Reverse-complement symmetry in memory embeddings
- Context-dependent gating aligned with functional regions
- Hierarchical representation from shallow to deep layers
2. Model Information
Model Configuration
The following details the model configuration, including the parameterization of Gengram, MoE routing strategies, and training hyperparameters used across all experiments.
Gengram Parameters
These parameters control how Gengram operates within the Transformer layers, including which layers to apply it to, the n-gram sizes, and embedding dimensions.
| Parameter | Description | Example |
|---|---|---|
--gengram-enabled |
Enable Gengram | true |
--gengram-layer-ids |
Layers to apply Gengram | 3 6 10 |
--gengram-ngram-sizes |
N-gram sizes for DNA processing | 1 2 3 4 5 6 |
--gengram-embed-dim-per-ngram |
Embedding dimension per n-gram | 1024 |
--gengram-window-size |
window size | 21 |
Mixture of Experts (MoE)
These parameters define the Mixture-of-Experts architecture, including the number of experts, routing top-k, and load balancing strategies during training.
| Parameter | Description | Default |
|---|---|---|
--num-experts |
Number of experts | 8 |
--moe-router-topk |
Top-k experts to route to | 2 |
--moe-router-load-balancing-type |
Load balancing strategy | aux_loss |
--moe-aux-loss-coeff |
Auxiliary loss coefficient | 1e-3 |
Training Parameters
These parameters specify the training setup, including sequence length, batch sizes, precision, and attention optimizations.
| Parameter | Description | Example |
|---|---|---|
--seq-length |
Maximum sequence length | 8192 |
--micro-batch-size |
Micro batch size per GPU | 1 |
--global-batch-size |
Global batch size across all GPUs | 1024 |
--bf16 |
Use BF16 precision | true |
--use-flash-attn |
Enable Flash Attention | true |
Pre-training Data
- Human Sequences: HPRC Release 2, GRCh38, CHM13
- Non-human Primates: NCBI RefSeq database
- Total: 200B tokens (8k context) + 100B tokens (32k context)
3. Performance Evaluation
Gengram demonstrates strong performance across multiple genomic benchmarks, achieving competitive results despite being trained on significantly fewer tokens and with a smaller model size.
| Metric | Gengram-10B | Genos-10B | Evo2-40B |
|---|---|---|---|
| Trained Tokens | 200B | 2.2T | 9.3T |
| Multi-species Exon Classification | 0.9832 | 0.9755 | 0.9332 |
| Splice Site Identification | 0.9009 | 0.7990 | 0.9138 |
| Human OCR Ensembl | 0.7714 | 0.7623 | 0.7635 |
Key Observations
- Data Efficiency: Achieves comparable performance using ~10Γβ40Γ fewer tokens
- Motif-Dominated Tasks: Up to 14% improvement
- Long-Context Modeling: Enhanced performance with shorter sequences
- Training Efficiency: Better parameter utilization and stable MoE training
Evaluation Benchmarks
- Genomic Benchmarks (GB)
- Nucleotide Transformer Benchmarks (NTB)
- Long-Range Benchmarks (LRB)
- Genos Benchmarks (GeB)
4. Quickstart
Model Download
Gengram model is available for download from Hugging Face. We provide torch version.
| Model | Activated Params | Hugging Face | Format |
|---|---|---|---|
| Gengram-10B | 2.87 B | π€ Hugging Face | torch |
Pre-training
Run the pre-training script with the following command:
cd Gengram
bash Gengram_layer3-6-10_win21_pp2.sh
5. License
This repository and the Gengram model weights are licensed under the Apache License 2.0.
Please note that the primary use of Gengram model is to support genomics research, providing researchers with advanced analytical capabilities and long-context modeling tools powered by large-scale foundation models for the human genome. It is not intended for use in any manner that violates applicable laws or regulations, nor for any activities prohibited by the license agreement.
6. Citation and Acknowledgements
We acknowledge the high-quality sequencing data provided by CycloneSEQ, which forms an important foundation for this work. We also appreciate the inspiration from DeepSeek's Engram module and the framework support provided by Megatron-LM. Model training was conducted on the 021 Science Foundation Model and Zero2X open platform.
If you use this work in your research, please cite the following paper:
@article@article{gengram2026,
title={Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram},
author={Genos Team and Xu, Huinan and Feng, Xuyang and Chen, Junhong and Liu Junchen and Deng, Kaiwen and Ding, Kai and Long, Shengning and Shuai, Jiaxue and Li, Zhaorong and Liu, Shiping and Xue, Guirong and Xiao, Zhan},
journal={arXiv preprint arXiv:2601.22203},
year={2026}
}
7. Contact
For project-related questions, please open an issue. You can also contact the Genos Team at [email protected].
