arxiv:2606.28057

MultiHashFormer: Hash-based Generative Language Models

Published on Jun 26

· Submitted by

Atsuki Yamaguchi on Jun 29

Upvote

Authors:

Huiyin Xue ,

Abstract

MultiHashFormer enables hash-based autoregression in language models by representing tokens as hash signatures processed through a Hash Encoder and Hash Decoder within a Transformer framework.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Language models (LMs) represent tokens using embedding matrices that scale linearly with the vocabulary size. To constrain the parameter footprint, prior work proposes hashing many tokens into a single vector within encoder-only models. While this offers parameter efficiency, many-to-one collisions prevent its use in causal LMs. In this paper, we propose MultiHashFormer, a new framework that allows hash-based autoregression. Each token is represented as a unique hash signature, a short sequence of discrete hash IDs, generated by multiple independent hash functions. A Hash Encoder compresses this signature into a single latent vector for processing by a Transformer decoder. Then, a Hash Decoder generates the hash signature of the next token, which is then mapped back to text. We evaluate our approach at the 100M, 1B and 3B parameter scales, demonstrating that MultiHashFormer consistently outperforms standard Transformer LMs across multiple benchmarks. Furthermore, we show that our model handles multilingual vocabulary expansion with a constant parameter footprint without any modifications.

View arXiv page View PDF GitHub 1 Add to collection

Community

atsuki-yamaguchi

Paper submitter about 10 hours ago

Token hashing has been confined to encoder-only architectures due to the conventional many-to-one collision problem, which breaks generative decoding. This paper addresses that limitation by representing tokens as unique, multi-ID combinatorially engineered hash signatures, which are then processed through a cascaded predictor decoder. Notably, this architecture consistently outperforms standard language models at a wide range of scales across core language benchmarks. More importantly, it supports a substantially large vocabulary with a fixed memory footprint, but achieves performance comparable to the standard vocabulary expansion approach. This work offers a highly viable solution for vocabulary modularity.