Agent-Capability-Aware Advantage Estimation: Removing this module leads to performance degradation due to systematic bias in advantage estimation caused by capability discrepancies.
Table 2. Ablation of Advantage Estimator
TL;DR HACPO is a novel collaborative reinforcement learning framework that enables heterogeneous agents to share rollouts.
[Background] In scenarios where multiple heterogeneous agents are to be optimized for the same task, vanilla Reinforcement Learning (RL) paradigms constrain each agent to execute tasks independently and update based solely on its own rollouts, they repeatedly generate trajectories and yield verifiable rewards, while these costly intermediate results are only utilized for self-training.
[Defination] We introduce a novel paradigm, termed Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), to overcome the inefficiencies of isolated on-policy optimization. HACRL emphasizes independent execution with collaborative optimization: agents share verified rollouts during training to enhance one another, while operating independently at inference time.
HACRL differs fundamentally from existing paradigms:
(1) LLM-based Multi-Agent Reinforcement Learning (MARL).
MARL trains agents to coordinate and jointly solve tasks through interaction within a coupled multi-agent system. In contrast, HACRL does not require coordinated execution. In many practical scenarios, only a single agent is deployed at inference time; however, we still desire that this agent benefits from knowledge acquired from other agents during training.
(2) On-/Off-Policy Distillation.
Distillation typically follows a one-directional “teacher-to-student” paradigm, often among homogeneous agents. HACRL instead enables bidirectional mutual learning among heterogeneous agents, where each agent simultaneously acts as both a knowledge provider and a learner.
[Algorithm] We introduce Heterogeneous Agent Collaborative Policy Optimization (HACPO) to break through this wasteful practice. It is a novel collaborative multi-agent RL training paradigm that enables rollout sharing, maximizing rollout utilization and facilitating mutual knowledge transfer among heterogeneous agents. To effectively bridge the discrepancies in capability and policy distribution, we introduce four tailored modifications and rigorously prove the unbiasedness of these improvements in advantage estimation and the validity of the optimization direction through theoretical analysis:
(1) Agent-Capability-Aware Advantage Estimation
(2) Model Capabilities Discrepancy Coefficient
(3) Exponential Importance Sampling
(4) Stepwise Clipping
[Performance] We evaluate HACPO across three types of heterogeneity and seven challenging mathematical reasoning benchmarks, demonstrating consistent performance improvements for all participating agents. Notably, HACPO outperforms the standard GSPO by an average of 3.3%, while using only half of the rollout cost. This reduction in rollout cost is achieved without sacrificing accuracy, highlighting HACPO's ability to both lower computational expenses and enhance performance. This work points to a promising new direction for efficient collaborative learning among multi-agent systems.
Figure 2. The significant differences among Multi-Agent RL, Knowledge Distillation, and the proposed HACRL. HACRL targets independent execution with collaborative optimization.
Typically, given one identical task, multiple agents execute RLVR optimization independently of one another.For essentially the same objective, they repeatedly generate trajectories and yield verifiable rewards, while these costly intermediate results are only utilized for self-training. We first formalize this setting as Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), which captures collaborative policy optimization among heterogeneous agents that execute independently at inference time.
HACRL differs fundamentally from LLM-based multi-agent reinforcement learning (MARL), which trains agents to coordinate and jointly solve tasks through interaction.It is also distinct from knowledge distillation (KD), where a typically fixed teacher transfers knowledge to a student. In contrast, HACRL focuses on independent execution with collaborative optimization: agents share verified rollouts during training, yet operate independently at inference time without coordination.
Definition of HACRL Problem: We consider the Heterogeneous Agent Collaborative Reinforcement Learning (HACRL) framework with n LLM agents.
During training, for a query $q \sim \mathcal D$, each agent $k$ independently samples $G$ candidate responses from its policy: $$Y_k(q) = \{y_{k,1}, \dots, y_{k,G}\} \sim \pi_{\theta_k}(\cdot \mid q).$$ The corresponding verifiable rewards are: $$ \mathcal R_k(q) = \{ R(y_{k,i}) \mid i = 1,\dots,G \}.$$ The objective is to optimize each agent $k \in \{1, \dots, n\}$ by maximizing: $$J^{(k)} = J^{(k)}_{homo}(Y_k(q), R_k(q)) + J^{(k)}_{hete}(\{Y_j(q), R_j(q)\}_{j \neq k})$$ where $J^{(k)}_{homo}$ is computed using rollouts generated by agent $k$ itself, and $J^{(k)}_{hete}$ leverages rollouts generated by other heterogeneous agents.
We categorize heterogeneity among distinct LLM agents into three types:
Figure 1. In HACPO, shared rollouts from multiple heterogeneous agents are leveraged for collaborative training. Built upon vanilla RL Optimization, HACPO introduces four algorithmic innovations to mitigate capability and policy distribution discrepancy.
To solve the HACRL problem, we propose Heterogeneous Agent Collaborative Policy Optimization (HACPO), which introduces four tailored modifications to bridge capability and distribution discrepancies between heterogeneous agents. Theoretically, we prove that HACPO is effective because learning from cross-agent rollouts induces an optimization direction consistent with standard on-policy learning.
We propose a capability-aware estimator that assign distinct inter-group advantage baseline for each agent based on their relative performance. Intuitively, the advantage of a response should be higher if it is generated by a stronger agent, and lower if it is generated by a weaker agent. Theoretically, we prove that this estimator is unbiased.
The baseline $\hat{\mu}^{(k)}_t$ is computed by: $$\hat{\mu}^{(k)}_t = \frac{1}{nG} \sum_{j=1}^n \sum_{i=1}^G \omega^{(k,j)}_t R(y^{(j)}_{t,i})$$ where $\omega^{(k,j)}_t = \hat{P}^{(k)}_t / \hat{P}^{(j)}_t$ is the capability ratio ($\hat{P}^{(k)}_t$ is the smoothed accuracy of agent $k$ at step $t$).
To encourage learning from stronger agents while being conservative with weaker ones, we use the capability ratio to modulate the effective advantage. The capability ratio $\omega_t^{(k,j)}$ plays two complementary roles: (i) Baseline Calibration — it rescales rewards when estimating the capability-aware baseline to align reward statistics across heterogeneous agents; (ii) Gradient Modulation — it serves as a learning-rate-like factor that amplifies gradients from stronger agents and attenuates those from weaker ones. The modulated advantage is:
Importance sampling is commonly used to correct distributional mismatches between samples generated by different policies. Following GSPO, we adopt a sequence-level importance ratio and extend it to the heterogeneous multi-agent setting. For combinations of heterogeneous agents with incompatible tokenizers, we detokenize the response into text and retokenize it using the target agent's tokenizer. Through sequence-level normalization, the slight length discrepancies arising from re-tokenization become negligible.
In heterogeneous settings, inter-agent policy discrepancies can be much larger than on-policy updates, making direct use of this ratio overly aggressive. To mitigate this issue, we introduce a non-gradient exponential reweighting to the importance sampling ratio. This design biases agent $k$ toward learning from agents whose output distributions are more aligned with its own, while reducing the impact of large cross-agent distribution shifts.
where $\alpha \geq 0$ controls the degree of conservativeness.
Cross-agent importance sampling ratios fluctuate irregularly both across steps and within a step. We first apply an asymmetric clipping bounds for cross-agent responses to ensure the cross-agent responses only be downweighted, but never upweighted. Then, we apply a stepwise clipping strategy to prevent cross-agent rollouts from dominating late-stage updates within a batch, thereby improving training stability.
We evaluate HACPO on 7.5k high-quality math questions from the MATH dataset and test on seven challenging benchmarks. To rigorously validate the effectiveness of our collaborative paradigm, we compare HACPO against three distinct types of baselines:
HACPO demonstrates consistent and superior performance improvements across three distinct heterogeneity settings:
Table 1. Main results across three heterogeneity settings. We compare our method against Standard Single-Agent Baselines (GRPO, GSPO), a Resource-Equivalent Baseline (GSPO×2) and a Naive multi-agent rollout share baseline(Naive).
Figure 3. Training curves of GSPO and HACPO. HACPO demonstrates faster convergence and higher final performance compared to single-agent baselines.
We validate the necessity of each component through extensive ablation studies.
Agent-Capability-Aware Advantage Estimation: Removing this module leads to performance degradation due to systematic bias in advantage estimation caused by capability discrepancies.
Table 2. Ablation of Advantage Estimator
Model Capabilities Discrepancy Coefficient: Essential for modulating gradients—amplifying signals from stronger agents while attenuating noise from weaker ones.
Table 3. Ablation of Model Capabilities Discrepancy Coefficient
Exponential Importance Sampling: Examining the impact of different values of $\alpha$ on the performance of HACPO.
Table 4. Impact of Exponential Importance Sampling ($\alpha$)
Stepwise Clipping: Crucial for stabilizing collaborative learning. Without it, high-variance cross-agent responses can destabilize the training process.
Figure 4. The Ablation of Stepwise Clipping. Removing clipping or the stepwise schedule leads to instability or suboptimal convergence.
If you find our work useful, please cite:
@misc{zhang2026heterogeneousagentcollaborativereinforcement,
title={Heterogeneous Agent Collaborative Reinforcement Learning},
author={Zhixia Zhang and Zixuan Huang and Xin Xia and Deqing Wang and Fuzhen Zhuang and Shuai Ma and Ning Ding and Yaodong Yang and Jianxin Li and Yikun Ban},
year={2026},
eprint={2603.02604},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.02604},
}