Sakana AI's RL Conductor: A 7B Model That Orchestrates GPT-5, Claude, and Gemini via Reinforcement Learning
TL;DR: Sakana AI unveiled RL Conductor, a 7B language model trained via reinforcement learning to dynamically orchestrate larger models like GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro. It outperforms both individual frontier models and human-designed multi-agent pipelines on reasoning and coding benchmarks โ at a fraction of the API cost.
The multi-agent AI market is full of frameworks that wire LLMs together with hard-coded pipelines. LangChain, CrewAI, AutoGen โ they all rely on human developers to manually define who talks to whom, in what order, and how the final answer is assembled. That works until the query distribution shifts, which it always does in production. This is a fundamentally different approach from architectural innovations like SubQ's subquadratic attention โ which improves individual model efficiency โ because RL Conductor targets the coordination layer between models.
Sakana AI, the Tokyo-based research lab known for its nature-inspired AI algorithms, took a fundamentally different approach. Instead of asking humans to design the orchestration logic, they trained a small model to learn it.
The RL Conductor: A 7B Brain for an LLM Orchestra
RL Conductor is a 7-billion-parameter model built on Qwen2.5-7B, trained end-to-end with reinforcement learning. Its job: analyze an incoming query, decide which models to call, design the communication topology between them, and assemble the final answer โ all dynamically, without any human-authored workflow.
"While using frameworks with hard-coded pipelines like LangChain and Mixture-of-Agents can work well for specific use cases," Yujin Tang, co-author of the paper, told VentureBeat, "achieving real-world generalization in such heterogeneous applications inherently necessitates going beyond human-hardcoded designs."
The Conductor works by generating a custom workflow in natural language for each query. At every step, it produces a natural language instruction for a specific sub-task, assigns a worker model, and defines an "access list" โ which prior subtask outputs should be visible to that agent. This lets it construct sequential chains, parallel tree structures, or recursive loops depending on the problem.
Benchmark Results: Smaller Model, Bigger Results
On AIME25 (advanced math reasoning), GPQA-Diamond (PhD-level science), and LiveCodeBench (real-world coding), RL Conductor orchestrating frontier models outperformed both:
- Each individual frontier model running alone (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro)
- Human-designed multi-agent pipelines using the same models
More importantly, it achieved these results using significantly fewer API calls and tokens per query. The Conductor learns through pure reward maximization โ it discovers organically which combinations of models and communication patterns produce correct answers, without any human workflow designer.
From Research to Product: Fugu
RL Conductor isn't just a paper โ it's the backbone of Fugu, Sakana AI's commercial multi-agent orchestration service currently in beta. Fugu abstracts away the complexity of LLM management for businesses, letting them deploy multi-agent systems without manually wiring pipelines.
The paper, Learning to Orchestrate Agents in Natural Language with the Conductor, was accepted to ICLR 2026 and represents a growing shift in how the industry thinks about multi-agent systems โ from human-designed to learned coordination.
Why This Matters
Most current approaches to multi-agent AI treat orchestration as a software engineering problem: define the graph, write the routing logic, handle edge cases. RL Conductor treats it as a learning problem. This distinction matters because production workloads have heterogeneity that no human designer can anticipate. Combined with automation patterns like cron-triggered agent workflows, learned orchestration could make multi-agent systems far more practical for real-world deployment.
As models get cheaper and more specialized, the bottleneck shifts from model capability to model coordination. Systems that can dynamically route tasks to the right expert model, learn from experience, and adapt their coordination strategy will outperform rigid pipelines every time. RL Conductor is an early glimpse of that future.
Frequently Asked Questions
How is RL Conductor different from Mixture-of-Experts?
MoE routes tokens within a single model's internal layers. RL Conductor routes entire queries between separate, independently running LLMs โ coordinating their outputs, context windows, and communication flow at the agent level, not the token level.
Is RL Conductor open-source?
The research paper and arXiv preprint are publicly available. Sakana AI has not announced plans to open-source the Conductor model itself, but Fugu (the commercial service) is available in beta.
What models can RL Conductor orchestrate?
Any LLM with an accessible API. The paper demonstrates results with GPT-5, Claude Sonnet 4, Gemini 2.5 Pro, and Llama 4 Maverick. The Conductor defines orchestration in natural language, so it can integrate any model it can prompt.
Sources
- arXiv: RL Conductor Paper
- VentureBeat: How Sakana Trained a 7B Model to Orchestrate GPT, Claude and Gemini
- IA Expertos: The Hidden Master โ Sakana AI's Orchestrator