Heterogeneous Agent Collaborative Reinforcement Learning

Zhixia Zhang^1*, Zixuan Huang^{1 2*}, Xin Xia², Deqing Wang¹, Fuzhen Zhuang¹, Shuai Ma¹, Ning Ding³, Yaodong Yang⁴, Jianxin Li¹, Yikun Ban^1†

¹Beihang University ²Bytedance China ³Tsinghua University ⁴Peking University
^*Equal Contribution ^†Corresponding Authors

TL;DR HACPO is a novel collaborative reinforcement learning framework that enables heterogeneous agents to share rollouts.

Introduction

[Background] In scenarios where multiple heterogeneous agents are to be optimized for the same task, vanilla Reinforcement Learning (RL) paradigms constrain each agent to execute tasks independently and update based solely on its own rollouts, they repeatedly generate trajectories and yield verifiable rewards, while these costly intermediate results are only utilized for self-training.

[Defination] We introduce a novel paradigm, termed Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), to overcome the inefficiencies of isolated on-policy optimization. HACRL emphasizes independent execution with collaborative optimization: agents share verified rollouts during training to enhance one another, while operating independently at inference time. HACRL differs fundamentally from existing paradigms:
(1) LLM-based Multi-Agent Reinforcement Learning (MARL). MARL trains agents to coordinate and jointly solve tasks through interaction within a coupled multi-agent system. In contrast, HACRL does not require coordinated execution. In many practical scenarios, only a single agent is deployed at inference time; however, we still desire that this agent benefits from knowledge acquired from other agents during training.
(2) On-/Off-Policy Distillation. Distillation typically follows a one-directional “teacher-to-student” paradigm, often among homogeneous agents. HACRL instead enables bidirectional mutual learning among heterogeneous agents, where each agent simultaneously acts as both a knowledge provider and a learner.

[Algorithm] We introduce Heterogeneous Agent Collaborative Policy Optimization (HACPO) to break through this wasteful practice. It is a novel collaborative multi-agent RL training paradigm that enables rollout sharing, maximizing rollout utilization and facilitating mutual knowledge transfer among heterogeneous agents. To effectively bridge the discrepancies in capability and policy distribution, we introduce four tailored modifications and rigorously prove the unbiasedness of these improvements in advantage estimation and the validity of the optimization direction through theoretical analysis:
(1) Agent-Capability-Aware Advantage Estimation
(2) Model Capabilities Discrepancy Coefficient
(3) Exponential Importance Sampling
(4) Stepwise Clipping

[Performance] We evaluate HACPO across three types of heterogeneity and seven challenging mathematical reasoning benchmarks, demonstrating consistent performance improvements for all participating agents. Notably, HACPO outperforms the standard GSPO by an average of 3.3%, while using only half of the rollout cost. This reduction in rollout cost is achieved without sacrificing accuracy, highlighting HACPO's ability to both lower computational expenses and enhance performance. This work points to a promising new direction for efficient collaborative learning among multi-agent systems.

Heterogeneous Agent Collaborative Reinforcement Learning

Figure 2. The significant differences among Multi-Agent RL, Knowledge Distillation, and the proposed HACRL. HACRL targets independent execution with collaborative optimization.

Typically, given one identical task, multiple agents execute RLVR optimization independently of one another.For essentially the same objective, they repeatedly generate trajectories and yield verifiable rewards, while these costly intermediate results are only utilized for self-training. We first formalize this setting as Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), which captures collaborative policy optimization among heterogeneous agents that execute independently at inference time.

HACRL differs fundamentally from LLM-based multi-agent reinforcement learning (MARL), which trains agents to coordinate and jointly solve tasks through interaction.It is also distinct from knowledge distillation (KD), where a typically fixed teacher transfers knowledge to a student. In contrast, HACRL focuses on independent execution with collaborative optimization: agents share verified rollouts during training, yet operate independently at inference time without coordination.

Definition of HACRL Problem: We consider the Heterogeneous Agent Collaborative Reinforcement Learning (HACRL) framework with n LLM agents.

During training, for a query $q \sim \mathcal D$, each agent $k$ independently samples $G$ candidate responses from its policy: $$Y_k(q) = \{y_{k,1}, \dots, y_{k,G}\} \sim \pi_{\theta_k}(\cdot \mid q).$$ The corresponding verifiable rewards are: $$ \mathcal R_k(q) = \{ R(y_{k,i}) \mid i = 1,\dots,G \}.$$ The objective is to optimize each agent $k \in \{1, \dots, n\}$ by maximizing: $$J^{(k)} = J^{(k)}_{homo}(Y_k(q), R_k(q)) + J^{(k)}_{hete}(\{Y_j(q), R_j(q)\}_{j \neq k})$$ where $J^{(k)}_{homo}$ is computed using rollouts generated by agent $k$ itself, and $J^{(k)}_{hete}$ leverages rollouts generated by other heterogeneous agents.

We categorize heterogeneity among distinct LLM agents into three types:

Heterogeneous State: Agents differ only in optimization state.
Heterogeneous Size: Agents share architecture but differ in parameter size.
Heterogeneous Model: Agents differ in architecture, tokenizer, and training objectives.

Heterogeneous Agent Collaborative Policy Optimization (HACPO)

Figure 1. In HACPO, shared rollouts from multiple heterogeneous agents are leveraged for collaborative training. Built upon vanilla RL Optimization, HACPO introduces four algorithmic innovations to mitigate capability and policy distribution discrepancy.

To solve the HACRL problem, we propose Heterogeneous Agent Collaborative Policy Optimization (HACPO), which introduces four tailored modifications to bridge capability and distribution discrepancies between heterogeneous agents. Theoretically, we prove that HACPO is effective because learning from cross-agent rollouts induces an optimization direction consistent with standard on-policy learning.

1.Agent-Capability-Aware Advantage Estimation

We propose a capability-aware estimator that assign distinct inter-group advantage baseline for each agent based on their relative performance. Intuitively, the advantage of a response should be higher if it is generated by a stronger agent, and lower if it is generated by a weaker agent. Theoretically, we prove that this estimator is unbiased.

In training step $t$, the advantage of the $i$-th response for agent $k$ is: $$A^{(k)}_{t,i} = \frac{R(y^{(k)}_{t,i}) - \hat{\mu}^{(k)}_t}{\sigma_{t,joint}}$$

The baseline $\hat{\mu}^{(k)}_t$ is computed by: $$\hat{\mu}^{(k)}_t = \frac{1}{nG} \sum_{j=1}^n \sum_{i=1}^G \omega^{(k,j)}_t R(y^{(j)}_{t,i})$$ where $\omega^{(k,j)}_t = \hat{P}^{(k)}_t / \hat{P}^{(j)}_t$ is the capability ratio ($\hat{P}^{(k)}_t$ is the smoothed accuracy of agent $k$ at step $t$).

2.Model Capabilities Discrepancy Coefficient

To encourage learning from stronger agents while being conservative with weaker ones, we use the capability ratio to modulate the effective advantage. The capability ratio $\omega_t^{(k,j)}$ plays two complementary roles: (i) Baseline Calibration — it rescales rewards when estimating the capability-aware baseline to align reward statistics across heterogeneous agents; (ii) Gradient Modulation — it serves as a learning-rate-like factor that amplifies gradients from stronger agents and attenuates those from weaker ones. The modulated advantage is:

$$\tilde{A}^{(k)}_{t,i} = \begin{cases} A^{(k)}_{t,i} & y^{(k)}_{t,i} \in D^{(k)}_t \\ \omega^{(j,k)}_t A^{(j)}_{t,i} & y^{(j)}_{t,i} \in D^{(j)}_t, j \neq k \end{cases}$$

3.Exponential Importance Sampling

Importance sampling is commonly used to correct distributional mismatches between samples generated by different policies. Following GSPO, we adopt a sequence-level importance ratio and extend it to the heterogeneous multi-agent setting. For combinations of heterogeneous agents with incompatible tokenizers, we detokenize the response into text and retokenize it using the target agent's tokenizer. Through sequence-level normalization, the slight length discrepancies arising from re-tokenization become negligible.

In heterogeneous settings, inter-agent policy discrepancies can be much larger than on-policy updates, making direct use of this ratio overly aggressive. To mitigate this issue, we introduce a non-gradient exponential reweighting to the importance sampling ratio. This design biases agent $k$ toward learning from agents whose output distributions are more aligned with its own, while reducing the impact of large cross-agent distribution shifts.

$$\tilde{s}^{(k,j)}_{t,i} = s^{(k,j)}_{t,i} \cdot (\text{sg}[s^{(k,j)}_{t,i}])^\alpha$$

where $\alpha \geq 0$ controls the degree of conservativeness.

4.Stepwise Clipping

Cross-agent importance sampling ratios fluctuate irregularly both across steps and within a step. We first apply an asymmetric clipping bounds for cross-agent responses to ensure the cross-agent responses only be downweighted, but never upweighted. Then, we apply a stepwise clipping strategy to prevent cross-agent rollouts from dominating late-stage updates within a batch, thereby improving training stability.

$$\text{clip}(s^{(k,j)}_{t,i}) = \text{clip}(s^{(k,j)}_{t,i}, 1 - \delta + k \cdot \delta_{step}, 1.0)$$ $k$ denote the number of parameter updates performed so far within the current step, and $\delta_{\mathrm{step}}$ denote the per-update tightening factor.

Experimental Results

Experimental Setup & Baselines

We evaluate HACPO on 7.5k high-quality math questions from the MATH dataset and test on seven challenging benchmarks. To rigorously validate the effectiveness of our collaborative paradigm, we compare HACPO against three distinct types of baselines:

Standard Single-Agent Baselines (GRPO, GSPO): Establish benchmarks for isolated training performance.
Resource-Equivalent Baseline (GSPO×2): A single-agent setting with double rollouts and updates. This rules out the impact of increased data volume, verifying that gains come from collaboration rather than just more compute.
Naive Collaborative Baseline (Naive): A multi-agent setting with simple rollout sharing but lacking our specific algorithmic innovations (Capability-Aware Estimation, Discrepancy Coefficient, etc.).

Main Results

HACPO demonstrates consistent and superior performance improvements across three distinct heterogeneity settings:

Heterogeneous State (Qwen3-4B + Qwen3-4B-Instruct): Even when agents differ only by optimization state, HACPO enables the stronger agent (Instruct) to benefit from the weaker one's complementary exploration signals (alternative reasoning paths and informative errors).
Heterogeneous Size (Qwen3-1.7B-Base + Qwen3-4B-Base): Both models improve significantly. The smaller model serves as a distinct explorer, facilitating bidirectional knowledge transfer that boosts the larger model's performance beyond self-training limits.
Heterogeneous Model (Qwen3-4B-Base + Llama3.2-3B-Instruct): Despite substantial differences in architecture, tokenizers, and training objectives, HACPO successfully extracts transferable knowledge from cross-model rollouts, improving both agents.

Table 1. Main results across three heterogeneity settings. We compare our method against Standard Single-Agent Baselines (GRPO, GSPO), a Resource-Equivalent Baseline (GSPO×2) and a Naive multi-agent rollout share baseline(Naive).

Figure 3. Training curves of GSPO and HACPO. HACPO demonstrates faster convergence and higher final performance compared to single-agent baselines.

Ablation Studies

We validate the necessity of each component through extensive ablation studies.

Agent-Capability-Aware Advantage Estimation: Removing this module leads to performance degradation due to systematic bias in advantage estimation caused by capability discrepancies.

Table 2. Ablation of Advantage Estimator

Model Capabilities Discrepancy Coefficient: Essential for modulating gradients—amplifying signals from stronger agents while attenuating noise from weaker ones.

Table 3. Ablation of Model Capabilities Discrepancy Coefficient

Exponential Importance Sampling: Examining the impact of different values of $\alpha$ on the performance of HACPO.

Table 4. Impact of Exponential Importance Sampling ($\alpha$)

Stepwise Clipping: Crucial for stabilizing collaborative learning. Without it, high-variance cross-agent responses can destabilize the training process.

Figure 4. The Ablation of Stepwise Clipping. Removing clipping or the stepwise schedule leads to instability or suboptimal convergence.

Citation

If you find our work useful, please cite:

@misc{zhang2026heterogeneousagentcollaborativereinforcement,
  title={Heterogeneous Agent Collaborative Reinforcement Learning},
  author={Zhixia Zhang and Zixuan Huang and Xin Xia and Deqing Wang and Fuzhen Zhuang and Shuai Ma and Ning Ding and Yaodong Yang and Jianxin Li and Yikun Ban},
  year={2026},
  eprint={2603.02604},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2603.02604},
}