Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

基本信息

摘要

English

As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.

中文

随着大语言模型智能体从孤立的工具用户发展为协调的团队,强化学习不仅需要优化个体行为,还必须优化工作的生成、委托、通信、聚合和停止方式。本文通过编排轨迹(orchestration traces)研究基于 LLM 的多智能体系统的强化学习:编排轨迹是时间交互图,其事件包括子智能体生成、委托、通信、工具使用、返回、聚合和停止决策。通过这一视角,我们识别出三个技术维度。第一,奖励设计涵盖八类,包括并行加速、拆分正确性和聚合质量的编排奖励。第二,奖励与信用信号可附加到从 token 到团队的八个信号承载单元;显式的反事实消息级信用在现有文献中尤为稀疏。第三,编排学习分解为五个子决策:何时生成、委托给谁、如何通信、如何聚合以及何时停止。截至 2026 年 5 月 4 日,在策划的文献池中未发现针对停止决策的显式 RL 训练方法。我们将学术方法与 Kimi Agent Swarm、OpenAI Codex 和 Anthropic Claude Code 的公开工业证据联系起来,揭示出公开部署范围与开放学术评估之间的规模差距。我们发布了包含 84 条目文献池、32 条排除日志和最小 JSON schema 的开源资源。

核心贡献

  1. 编排轨迹框架:首次将多智能体系统的 RL 优化形式化为编排轨迹(orchestration traces)——涵盖子智能体生成、委托、通信、聚合和停止的事件图,为多智能体 RL 提供了统一的抽象。
  2. 三维度系统化分析:从奖励设计(8 类)、信用分配(8 个信号承载单元)、编排学习(5 个子决策)三个维度系统梳理现有方法,识别出关键空白(如消息级反事实信用、停止决策的 RL 训练)。
  3. 工业-学术界差距实证:通过连接学术方法与 Kimi Agent Swarm、OpenAI Codex、Anthropic Claude Code 的公开工业证据,揭示了部署规模与学术评估之间的差距,为后续研究提供了方向性指引。
  4. 开放资源发布:提供 84 条目文献池、32 条排除日志、语料库统计和可重播编排轨迹的 JSON schema。

方法概述

本文是一篇综述/立场论文(survey/position paper),主要方法为系统文献梳理和概念框架构建:

实验结果

本文为综述论文,无传统实验。主要发现包括:

局限性与注意点

相关概念(详细)

相关概念


导入时间: 2026-05-05 06:01 来源: arXiv Daily Wiki Update 2026-05-05