Building a multi agent debate committee in Go

A single LLM call gives you one perspective. Ask the same question twice with different system prompts and you’ll get meaningfully different answers, different assumptions, different blind spots, different strengths. This isn’t a bug. It’s the foundation of a useful multi-agent pattern.
The idea is old. Juries deliberate. Academic peer review works because reviewers disagree. Design reviews surface risks that the original author missed. The mechanism is always the same: independent reasoning followed by structured synthesis. LLMs are well-suited to both steps.
In this post I’ll walk through building a debate committee in Go using Phero. Three agents argue independently, a judge synthesizes, and the whole thing runs in ~180 lines with no shared state between committee members.
What we’re building
A CLI that takes a question and runs it through a structured debate:
- Three committee members (Advocate, Skeptic, Minimalist) each produce an independent argument
- A Judge agent reads all arguments and produces a single synthesized answer
Here’s what a run looks like:
| |
Each member sees only the question. The judge sees all arguments. The final answer is better than any single agent would produce alone.
Why multiple agents?
The obvious question: why not just ask one model to “consider multiple perspectives”? You can, and sometimes it works. But there are structural reasons to prefer actual separation.
Prompt focus. A system prompt that says “be an advocate for the strongest approach” produces different reasoning than one that says “find the failure modes.” When you ask a single model to do both, it hedges. When you give each role its own agent, each one commits to its perspective.
Independence. Committee members don’t see each other’s arguments. This prevents anchoring, a well-known bias where the first answer dominates subsequent reasoning. The Skeptic isn’t reacting to the Advocate; it’s reasoning from scratch.
Composability. You can add or remove committee members without changing the judge. Swap the Minimalist for a Security Auditor. Add a Cost Analyst. The orchestration code doesn’t change.
Debuggability. When the final answer is wrong, you can trace it back. Was the Advocate’s proposal flawed? Did the Skeptic miss a real risk? Did the Judge weigh arguments poorly? With a single agent, you just get a wrong answer with no decomposition.
The architecture
| |
This is a fan-out/fan-in pattern. The committee step fans out the question to N independent agents. The judge step fans in the results. There’s no iteration, no tool calling, no shared memory. Each agent makes exactly one LLM call.
Setting up the LLM
All agents share the same LLM client. The debate pattern works with any provider that implements llm.LLM:
| |
Or with a local model via Ollama:
| |
Using the same client for all agents keeps things simple. In production you might use a cheaper model for committee members and a stronger one for the judge, since the synthesis step benefits more from reasoning quality.
Building the committee
Each committee member is a standard agent.Agent with a different system prompt. The prompt defines the role and constrains the behavior:
| |
The shared rules are appended to each member’s role-specific prompt. This ensures consistent format while allowing divergent reasoning.
The advocate
| |
The Advocate’s job is to build the best case for a solution. It’s optimistic, concrete, and action-oriented.
The skeptic
| |
The Skeptic isn’t contrarian for its own sake. It identifies what can go wrong and proposes mitigations. This is the agent that catches the assumptions others take for granted.
The minimalist
| |
The Minimalist pushes back against over-engineering. It asks “do you actually need three agents for this?” and often the answer is no.
These three roles create a productive tension: ambition vs. caution vs. simplicity. The combination surfaces trade-offs that no single perspective would identify.
The judge
The Judge is also an agent.Agent, but with a fundamentally different job. Instead of answering the original question, it evaluates arguments:
| |
Two things to note about the Judge’s prompt. First, it explicitly tells the model not to mention internal roles. The final answer should read as a standalone recommendation, not a summary of a debate. Second, it asks the Judge to call out weak claims. This prevents the Judge from being a simple average of the three inputs.
Orchestrating the debate
The orchestration is plain Go. No framework abstractions, no DAG definitions, no YAML. Fan out to the committee, collect results, render them for the judge, get the final answer:
| |
The renderJudgeInput function formats the arguments into a structured text block:
| |
This produces YAML-like structured input that’s easy for the LLM to parse. The numbered member labels (member_1, member_2) give the Judge a way to reference specific arguments without leaking role names.
Deliberate isolation: no shared memory
Notice that committee members have no memory and don’t see each other’s output. This is intentional.
Shared context between agents creates coupling. If the Skeptic reads the Advocate’s answer first, it anchors on that answer and critiques it specifically rather than reasoning independently about the question. The resulting “debate” is really just a review, which is a different (and less useful) pattern.
By keeping members isolated, you get genuinely diverse outputs. The diversity is what makes the Judge’s synthesis valuable. If all three agents said the same thing, you wouldn’t need a committee.
Running the example
| |
The default question is about multi-agent design for test diagnosis, which makes it a meta-example: agents debating how to build agents.
Cost and latency
This pattern makes 4 LLM calls per run: one per committee member plus one for the judge. The calls are sequential in this example, but the committee calls are independent and could easily be parallelized with goroutines:
| |
With parallel execution, wall-clock time is roughly max(member latencies) + judge latency instead of sum(all latencies). For three members on GPT-4o, that’s typically ~3s instead of ~6s.
Token cost scales linearly with the number of committee members, but the judge’s input also grows. With three members producing ~200 lines each, the judge’s input can be 2-3x a single member’s output. Keep this in mind when adding members.
Variations
The debate committee pattern is flexible. Here are a few useful variations:
Multi-round debate. Instead of one round, let members respond to each other. Feed the first round’s arguments back to each member and ask them to revise. This converges toward consensus but costs more LLM calls.
Voting instead of a judge. Have each member vote on the best argument (including their own). If two or more agree, use that answer. Only invoke the judge when there’s no majority. This saves one LLM call in the common case.
Specialized committees. Match the roles to the domain. For code review: Correctness Expert, Performance Expert, Security Auditor. For product decisions: User Advocate, Engineering Lead, Business Analyst.
Tool-equipped members. Give committee members tools. The Advocate could search documentation, the Skeptic could run static analysis, the Minimalist could check dependency counts. Each member’s tools reflect its role.
What to try next
- Add tracing with
trace/textto see the full message flow between agents - Parallelize the committee calls with goroutines for lower latency
- Add tools to committee members for grounded reasoning
- Try different models for different roles (fast/cheap for members, strong for the judge)
- Combine with other patterns: use a debate committee as a single step in a larger orchestrator-workers pipeline
Wrapping up
The debate committee is one of the simplest multi-agent patterns, and one of the most effective. No shared state, no complex coordination, no iteration. Just independent reasoning followed by structured synthesis.
The key insight is that LLM outputs are highly sensitive to system prompts. By giving the same question to agents with different perspectives, you get genuine diversity of thought. The judge turns that diversity into a better answer than any single agent would produce.
Building it in Go with Phero takes ~180 lines. The orchestration is explicit, the agents are standard, and the pattern is easy to extend.
The full source is at examples/debate-committee/.
Phero is an open-source Go framework for building multi-agent AI systems. Star the repo if you find it useful.