DiffusionGemma-26B-A4B-it-Infinite-Context: External Evidence Memory for the Diffusion LLM Era
How many tokens can the model fit into one prompt?
NZFC-GRAM asks a different product question:
Can an AI system prove which memory it is allowed to use, which memory it must ignore, and which evidence actually supports the answer?
That distinction matters.
Google's diffusiongemma-26B-A4B-it introduces a different generation direction for open models: discrete diffusion over text, built on the Gemma 4 26B A4B Mixture-of-Experts architecture. The official model card describes it as a multimodal open-weights model for text, image, and video inputs, with long context up to 256K tokens and a block-diffusion generation design. NZFC-GRAM adds a separate layer on top: external memory, large-document retrieval, deletion boundaries, scoped evidence filtering, and bounded answer generation.
The preview repository is named:
DiffusionGemma-26B-A4B-it-Infinite-Context
The name is intentionally bold.
The technical boundary is intentionally precise:
Infinite-Context = external evidence context
not native unlimited model context
What this release is
This repository is not a new base model.
It is an NZFC-GRAM runtime overlay for:
google/diffusiongemma-26B-A4B-it
It does not redistribute Google model weights. It does not modify the base model. It does not claim that DiffusionGemma itself has native infinite context.
Instead, it connects DiffusionGemma to a governed external evidence layer:
external memory
-> scoped retrieval
-> tombstone filtering
-> malicious-memory redaction
-> large-document indexing
-> bounded evidence pack
-> optional DiffusionGemma generation
The guiding principle is simple:
Memory is evidence, not instruction.
Why DiffusionGemma is a strong match for NZFC-GRAM
DiffusionGemma is different from a standard autoregressive LLM.
The official Hugging Face model card describes DiffusionGemma as a generative model by Google DeepMind, based on the 26B A4B Mixture-of-Experts Gemma 4 architecture. It generates tokens using discrete diffusion, handles text, image, and video inputs, and is designed for high-speed generation. The Gemma 4 documentation and model cards also describe long context support up to 256K tokens for the 26B A4B class.
That makes it a natural partner for NZFC-GRAM.
DiffusionGemma provides the high-capacity working context and generation engine.
NZFC-GRAM provides the memory boundary.
DiffusionGemma:
large native working context
diffusion-style generation
multimodal input
fast block generation
NZFC-GRAM:
external memory
large-document evidence retrieval
scope isolation
tombstone deletion
malicious-memory redaction
bounded evidence packs
The two layers do not compete. They solve different problems.
DiffusionGemma answers from a prompt.
NZFC-GRAM decides what evidence is allowed into that prompt.
Long context is not the same as governed memory
A large native context window is useful. But it does not automatically solve long-term memory.
A memory system still needs to answer questions like:
Was this memory deleted?
Does this memory belong to this user?
Does it belong to this project?
Is this memory a prompt injection attempt?
Is this private fact actually supported by evidence?
Is this document chunk relevant enough to include?
Should this memory be treated as instruction or only as evidence?
NZFC-GRAM is designed around those questions.
It does not simply append everything to the prompt.
It performs a governed readout:
query
-> retrieve candidate evidence
-> filter by user/project/session scope
-> filter tombstoned memory
-> redact untrusted injection-like memory
-> build bounded evidence pack
-> generate answer
This is why the repo uses the phrase:
Infinite-Context
but defines the mechanism as:
external evidence context
The base model still has its native context limit. NZFC-GRAM extends the usable memory surface through retrieval and evidence governance, not by pretending the model has unlimited native context.
The core mechanism
The runtime can be summarized as a query-conditioned evidence readout.
For a user query (q), the system does not push the whole memory archive into the prompt. Instead, it constructs a bounded evidence pack:
[ E_q = B_{\text{scope, tombstone, trust}} , R_q(A) ]
where:
A = external memory and document archive
R_q = query-conditioned retrieval operator
B = boundary filter for scope, deletion, trust, and safety
E_q = bounded evidence pack
Then the generation model receives:
answer = G(q, E_q)
The important point is not that the archive becomes infinite prompt text.
The point is that the archive becomes externally readable evidence under a boundary condition.
Runtime-only validation is already passing
The latest repository update includes the full NZFC-GRAM runtime root assets required for local initialization:
runtime/
meta/
memory_tensors/
nzfc_gram_runtime/
validation/
examples/
A fresh Hugging Face download was validated with a runtime-only smoke test.
That test did not load the 26B base model. It validated the NZFC-GRAM overlay layer.
The runtime-only validation confirmed:
repo-root runtime assets found
meta assets found
memory_tensors assets found
NZFC runtime initializes
exact-slot memory recall works
large-document retrieval works
tombstone retrieval guard works
validation script can be executed directly
The smoke test result included:
{
"model_loaded": false,
"runtime_only": true,
"repo_root_runtime_exists": true,
"repo_root_meta_exists": true,
"repo_root_memory_tensors_exists": true,
"exact_slot_answer": "PROJECT_CODE_RUNTIME_VALIDATION",
"exact_slot_passed": true,
"large_document_chunk_count": 3,
"large_document_query_count": 2,
"large_document_passed": true,
"tombstone_test": {
"available": true,
"passed": true,
"before_found": true,
"after_found": false,
"tombstoned": 1
},
"technical_boundary": "external evidence context, not native unlimited model context"
}
This means the repository is no longer just a naming preview. It is a runnable NZFC-GRAM overlay package with runtime assets, examples, validation scripts, and evidence documentation.
What the runtime does before generation
NZFC-GRAM is intentionally model-agnostic at the memory-governance layer.
Before any generation call, the runtime can operate over:
SQLite local memory
static archive metadata
large-document chunks
legal-document article sections
retrieved evidence cards
Then it constructs a bounded evidence pack.
That evidence pack is what the generation model sees.
This is the key distinction:
Bad design:
dump the whole memory or document into the prompt
NZFC-GRAM design:
retrieve only scoped, active, relevant, non-malicious evidence
For large documents, the intended path is:
ingest
-> chunk
-> SQLite FTS5 index
-> query-time retrieval
-> bounded evidence pack
-> answer
A 100MB+ document should not be inserted directly into the prompt.
It should be indexed once, then queried as evidence.
Why this matters for DiffusionGemma
DiffusionGemma is designed for fast generation and long working context. The base model is a new kind of open model that explores text generation through diffusion rather than a purely token-by-token autoregressive path.
That speed profile is useful for interactive systems.
But interactive systems need memory discipline.
A fast model that reads unsafe memory is still unsafe.
A long-context model that reads deleted memory is still wrong.
A multimodal model that reads every document chunk without governance can still hallucinate unsupported claims.
NZFC-GRAM provides the missing boundary layer:
What can be read?
What must be ignored?
What was deleted?
What is evidence?
What is instruction?
What is unsupported?
That is where the synergy is strongest.
Example: exact memory without loading the base model
The runtime can answer deterministic memory questions without loading DiffusionGemma.
Example stored memory:
The project high-frequency test code is PROJECT_CODE_RUNTIME_VALIDATION.
Question:
What was the project high-frequency test code? Answer only with the code.
Answer:
PROJECT_CODE_RUNTIME_VALIDATION
This is handled by the exact-slot mapper.
It is intentionally strict. It only triggers on short, explicit exact-recall questions. Broad prompts such as “explain how memory governance works” continue through the evidence and generation pipeline.
Example: tombstone behavior
Deleted memory should not remain active evidence.
The runtime-only validation tested this:
before deletion: secret found
after tombstone: secret not found
The tombstone guard filters inactive or deleted MEM_* records from low-level retrieval.
That matters because deletion should not only be a generation-layer behavior. It should also apply at the retrieval API boundary.
Example: large-document evidence
A small runtime-only document test confirmed:
large_document_chunk_count: 3
large_document_query_count: 2
large_document_passed: true
The point is not the size of this smoke test.
The point is that the repository includes the working path:
document -> chunk -> index -> retrieve evidence
That is the path that scales to larger documents.
Repository structure
nzfc_gram_runtime/ Python runtime package
runtime/ Hybrid exact-recall runtime assets
meta/ Static archive metadata
memory_tensors/ Static archive tensor manifests and assets
docs/ Architecture and technical boundary notes
examples/ Runtime-only and optional model-load examples
validation/ Validation scripts
validation_evidence/ Saved validation evidence
release_notes/ Release notes
Quick start
Clone the repo:
git lfs install
git clone https://huggingface.co/SingularityPrinciple/DiffusionGemma-26B-A4B-it-Infinite-Context
cd DiffusionGemma-26B-A4B-it-Infinite-Context
pip install -r requirements.txt
Run the runtime-only validation:
python validation/run_runtime_only_smoke.py
Expected result:
[PASS] runtime-only smoke passed
Run the runtime-only exact-memory example:
python examples/high_frequency_multi_context_runtime_only.py
Run the runtime-only large-document example:
python examples/large_document_runtime_only.py
Optional model-load check:
LOAD_MODEL=1 python examples/optional_diffusiongemma_model_load_check.py
The optional model-load check requires hardware capable of loading google/diffusiongemma-26B-A4B-it.
What has not been claimed yet
This preview is deliberately transparent.
The following have not yet been claimed by this repository:
full DiffusionGemma 26B model-load validation
multimodal generation validation
native 256K context stress test
production serving benchmark
vLLM/SGLang deployment benchmark
The current validated claim is narrower and stronger:
The NZFC-GRAM runtime overlay initializes from a fresh HF download and passes runtime-only memory, document, and tombstone validation.
What this is not
Not native infinite context
Not internal infinite model memory
Not a claim that DiffusionGemma itself has unlimited context
Not a zero-hallucination guarantee
Not legal advice
Not a production security certification
Not affiliated with Google
Not a redistribution of Google model weights
Why this is still worth releasing now
DiffusionGemma changes the generation side.
NZFC-GRAM changes the memory boundary.
The combination is useful because next-generation models will not only need larger context windows. They will need governed evidence pipelines.
The future of long-context AI is not just:
more tokens
It is:
more controlled evidence
This repository is a preview of that direction.
One-line summary
DiffusionGemma-26B-A4B-it-Infinite-Context is an NZFC-GRAM runtime overlay for external evidence context around Google's DiffusionGemma 26B A4B-IT. It includes runtime assets, scoped memory, exact-slot recall, tombstone filtering, large-document retrieval, validation scripts, and runtime-only validation evidence. The title is marketing-facing; the technical mechanism is external evidence context, not native unlimited model context.
