How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning
Abstract
Research investigates the impact of different attention masking strategies on user embedding quality in decoder-only language models, proposing a gradient-guided soft masking technique to improve training stability and representation quality for user behavior analysis.
Decoder-only large language models are increasingly used as behavioral encoders for user representation learning, yet the impact of attention masking on the quality of user embeddings remains underexplored. In this work, we conduct a systematic study of causal, hybrid, and bidirectional attention masks within a unified contrastive learning framework trained on large-scale real-world Alipay data that integrates long-horizon heterogeneous user behaviors. To improve training dynamics when transitioning from causal to bidirectional attention, we propose Gradient-Guided Soft Masking, a gradient-based pre-warmup applied before a linear scheduler that gradually opens future attention during optimization. Evaluated on 9 industrial user cognition benchmarks covering prediction, preference, and marketing sensitivity tasks, our approach consistently yields more stable training and higher-quality bidirectional representations compared with causal, hybrid, and scheduler-only baselines, while remaining compatible with decoder pretraining. Overall, our findings highlight the importance of masking design and training transition in adapting decoder-only LLMs for effective user representation learning. Our code is available at https://github.com/JhCircle/Deepfind-GGSM.
Community
๐ How Do Decoder-Only LLMs Perceive Users? Rethinking Attention Masking for User Representation Learning
Hi everyone!
We're excited to share our latest academic work on adapting decoder-only LLMs for high-quality user representation learning, with a focus on solving training instability when transitioning from causal to bidirectional attention masking. Our paper is now publicly available, and we'd love to hear your feedback!
โจ Key Contributions
Unified Masking Strategy
We systematically evaluate three attention masking recipes (Causal, Hybrid, Bidirectional) under a contrastive learning framework, using 9 anonymized real-world user modeling benchmarks (covering user behavior prediction, preference understanding, and intent sensitivity tasks).Critical Training Transition Insight
We find that the transition path from causal to bidirectional attention is equally important as the final mask design: abrupt switching disrupts the model's pretrained inductive bias, leading to suboptimal performance and convergence issues.Gradient-Guided Soft Masking (GG-SM)
We propose a two-stage training approach to mitigate instability:- Gradient Warm-up: Dynamically assign attention weights to future tokens during early training using gradient norms, enabling the model to prioritize informative context gradually.
- Linear Scheduler: Smoothly transition from the gradient-calibrated soft mask to full bidirectional attention, preserving pretrained model knowledge while adapting to bidirectional modeling.
๐ฌ Let's Discuss!
We'd love to engage with the community on:
Have you encountered training instability when adapting decoder-only LLMs for non-generative tasks like representation learning?
What tradeoffs do you prioritize between autoregressive compatibility and representational completeness in user modeling?
Feel free to drop your questions, suggestions, or reproducibility feedback belowโwe're happy to collaborate on further improvements!
Thanks for reading! ๐
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper