Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate

基本信息

arXiv ID: 2305.19118
作者: Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, Zhaopeng Tu
分类: cs.CL, cs.CL
导入类型: url

摘要

Modern large language models (LLMs) like ChatGPT have shown remarkable performance on general language tasks but still struggle on complex reasoning tasks, which drives the research on cognitive behaviors of LLMs to explore human-like problem-solving strategies. Along this direction, one representative strategy is self-reflection, which asks an LLM to refine the solution with the feedback generated by itself iteratively. However, our study shows that such reflection-style methods suffer from the Degeneration-of-Thought (DoT) problem: once the LLM has established confidence in its solutions, it is unable to generate novel thoughts later through reflection even if its initial stance is incorrect. To address the DoT problem, we propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of “tit for tat” and a judge manages the debate process to obtain a final solution. Clearly, our MAD framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation. Experiment results on two challenging datasets, commonsense machine translation and counter-intuitive arithmetic reasoning, demonstrate the effectiveness of our MAD framework. Extensive analyses suggest that the adaptive break of debate and the modest level of “tit for tat” state are required for MAD to obtain good performance. Moreover, we find that LLMs might not be a fair judge if different LLMs are used for agents. Code is available at https://github.com/Skytliang/Multi-Agents-Debate.

核心贡献

揭示思维退化问题（Degeneration-of-Thought, DoT）：发现现有的自反思（self-reflection）方法存在严重缺陷——一旦 LLM 对自身方案建立信心，即使初始立场错误，后续反思也无法产生新的有效思考，导致思维僵化。
提出多智能体辩论框架（MAD）：通过多个智能体以”针锋相对”（tit for tat）的方式表达论点，配合裁判管理辩论过程，有效鼓励 LLM 的发散性思维（divergent thinking），突破 DoT 困境。
发现 LLM 裁判的公平性问题：实验发现当使用不同 LLM 作为辩论智能体时，LLM 裁判可能无法保持公平判断，存在偏好倾向。
辩论过程的关键调控因素：识别出辩论的自适应终止（adaptive break）和适度的对抗程度是 MAD 取得良好性能的关键因素。
开源代码：代码公开于 GitHub（Skytliang/Multi-Agents-Debate），便于复现和扩展。

方法概述

MAD（Multi-Agent Debate）框架的核心方法如下：

多智能体辩论机制：初始化多个 LLM 智能体，每个智能体独立生成对问题的回答。随后进入多轮辩论，每个智能体根据其他智能体的观点修改或坚持自己的论点，形成”针锋相对”的辩论态势。
裁判管理：引入一个裁判智能体（或同一 LLM 的裁判角色）监控辩论过程，判断各方论点的质量，管理辩论的进行和终止。
自适应终止：辩论并非固定轮次，而是根据论点的收敛程度和一致性动态决定何时终止，避免过度辩论或过早停止。
发散性思维鼓励：与自反思方法不同，MAD 通过外部不同视角的碰撞来激发新的思考，而非依赖 LLM 自身的迭代改进，从而有效避免了 DoT 问题。
最终方案生成：辩论结束后，裁判综合各方论点，生成最终解决方案。

实验结果

MAD 在两个具有挑战性的数据集上进行了实验验证：

常识机器翻译（Commonsense Machine Translation）：在需要常识推理的翻译任务中，MAD 显著优于单智能体自反思方法，能够生成更准确、更符合语境的翻译结果。
反直觉算术推理（Counter-intuitive Arithmetic Reasoning）：在需要突破常规思维的算术推理任务中，MAD 同样展现出明显优势，多智能体辩论有效帮助模型克服了直觉偏见。
关键发现：
- 自适应终止的重要性：适当控制辩论轮次对性能至关重要，过少轮次无法充分激发发散思维，过多轮次则可能导致论点退化。
- 适度对抗原则：辩论中”针锋相对”的程度需要适度，过于激进或过于温和都会影响最终效果。
- LLM 裁判偏见：当辩论智能体使用不同 LLM 时，裁判 LLM 可能对同族模型存在偏好，影响判断公平性，这提示在实际应用中需要注意裁判与智能体的模型匹配问题。

分析信息

分析来源: pdf_analysis
分析置信度: high
分析时间: 2026-05-02 06:02
关键词: GPT, LLM, large language model, RL, generation, translation
PDF 路径: /root/wiki/raw/papers/2305-19118.pdf

导入时间: 2026-05-01 23:30 导入方式: url