MultiHashFormer: Hash-based Generative Language Models
Abstract
MultiHashFormer enables hash-based autoregression in language models by representing tokens as hash signatures processed through a Hash Encoder and Hash Decoder within a Transformer framework.
Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.
Community
Token hashing has been confined to encoder-only architectures due to the conventional many-to-one collision problem, which breaks generative decoding. This paper addresses that limitation by representing tokens as unique, multi-ID combinatorially engineered hash signatures, which are then processed through a cascaded predictor decoder. Notably, this architecture consistently outperforms standard language models at a wide range of scales across core language benchmarks. More importantly, it supports a substantially large vocabulary with a fixed memory footprint, but achieves performance comparable to the standard vocabulary expansion approach. This work offers a highly viable solution for vocabulary modularity.
Models citing this paper 28
klein9692/mhf_1b_32768_4_64
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper