Papers
arxiv:2602.12566

To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Published on Feb 13
Authors:
,
,
,
,
,

Abstract

Reinforcement Learning with Verifiable Rewards enables Large Language Models to achieve expert-level performance in specific domains, with multi-domain training paradigms showing minimal interference and synergistic effects in reasoning-intensive tasks.

AI-generated summary

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, instruction following, and agent) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, information constraints, model prediction behavior and self-verification. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/Mosi-AI/M2RL.

Community

Really great works! And could you open sourced the models after training and before merging? This would greatly help in others working on related models merging!

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.12566
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.12566 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.12566 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.12566 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.