PettingLLMs¶
Reinforcement Learning Framework for Multi LLM Agents ๐๐
Overview¶
PettingLLMs is an open-source framework for on-policy reinforcement learning (RL) with multi-agent large language models (LLMs).
It implements AT-GRPO (Agent- and Turn-wise Group Relative Policy Optimization), a novel algorithm and system design for training collaborative LLM agents across planning, coding, and mathematical reasoning tasks.
๐ Documentation Directory¶
- Home
- Getting Started
- Installation
- Quick Start
- Datasets
- Training
- Evaluation
- Environment Setup
- Overview
- Data Preparation
- Core Architecture
- Agent Functions
- Environment State
- Configuration
- Registration
- Core Concepts
- Overview
- AT-GRPO Algorithm
- Three-Level Specialization
- Training System
- Training Guides
- Overview
- Games
- Planning
- Code
- Math
- Evaluation
- Evaluation Guide
- Results
- Benchmarks
- Ablation Studies
- API Reference
- Python API
- About Us
- Team
Supported Training Modes¶
This framework supports:
- โ Single-agent RL training
- โ Multi-agent RL training (role-sharing policy)
- โ Multi-agent RL training (role-specialized policies using different LoRA adapters or different LLMs)
๐ฐ News¶
- [2025.10] ๐ GitHub repository open-sourced and publicly available
- [2025.10] ๐ Paper released! Check out our arxiv preprint
- [2025.10] ๐ฅ Support for different LoRA adapters per agent role - enabling efficient role-specialized training
- [2025.09] ๐ Multi-environment support added: Game (Sudoku, Sokoban), Code (APPS, CodeContests), and Math (AIME, OlympiadBench)
- [2025.08] ๐ค Multi-agent framework implementation: support for both shared single model and role-specific models
๐ Key Features¶
- Multi-Level Agent Specialization: Train and specialize agents at any level, from lightweight prompt adjustments to full model fine-tuning with LoRA or reinforcement learning.
- Novel RL Algorithm: Implements Agent- and turn wise GRPO- AT-GRPO for efficient and stable multi-agent training.
- Built-in Multi-Turn MAS Workflows: Comes with predefined, reproducible benchmarks and environments for a variety of domains:
- ๐ฎ Games: Sudoku (4x4), Sokoban (6x6)
- ๐ Planning: Plan-Path (10x10 grid)
- ๐ป Coding: APPS, CodeContests, LiveCodeBench
- ๐ข Math: AIME24/25, OlympiadBench
๐ฉ Roadmap¶
- More Environments: Verilog design, web search, robotics, database query, scientific discovery
- Multi-Modal Support: Vision-language models, audio processing, mixed-modal tasks
- Agentic Framework Integration: AutoGen, LangGraph, CrewAI, and custom framework APIs
๐ Key Results¶
Table 3 ยท Ablation on Plan-Path (Qwen3-1.7B)
| Method | Acc.(%) | ฮ |
|---|---|---|
| Single agent | 5.00 | โ |
| Training tool agent in SA, eval in SA | 11.00 | +6.00 |
| Training code agent in SA, eval in SA | 14.50 | +9.50 |
| Training in SA, eval in MAS | 16.00 | +11.00 |
| MAS RL (role specific policies), eval in MAS | 96.00 | +91.00 |
| w/ Swapped Policies | 6.00 | +1.00 |
๐ Environment Workflows (MA vs. SA)¶
๐ฆ Installation¶
๐ฏ Quick Start¶
1. Dataset Preparation¶
Prepare datasets for different tasks:
# Code tasks (APPS, CodeContests, LiveCodeBench)
python scripts/dataprocess/load_code.py
# Math tasks (AIME24/25, OlympiadBench)
python scripts/dataprocess/load_math.py
# Game/Planning tasks (Sokoban, Sudoku)
python scripts/dataprocess/load_sokoban.py
Datasets will be saved to datasets/code/, datasets/math/, and datasets/sudoku_environments/.
2. Training¶
Example: Train multi-agent system on math tasks
Other training scripts available in scripts/train/:
- code_single_policy.sh, code_two_policy.sh - Code domain
- plan_path_single.sh, plan_path_two_policy.sh - Planning domain
- sokoban_two_policy.sh, sokodu_single.sh - Game domain
3. Evaluation¶
Example: Evaluate trained model
Edit scripts/evaluate/evaluate.sh to set your model path and config:
Then run:
๐งฑ Three Levels of Agent Specialization¶
PettingLLMs uses a tiered approach to define agent roles, ranging from simple instructions to deep model specialization.
| Level | Role Specialization Method | Description |
|---|---|---|
| L0 | Shared model | Roles are defined solely through instructions in the prompt. The base model is identical for all agents, offering a flexible but performance-limited baseline. |
| L1 | LoRA | Each role is specialized using a unique, lightweight LoRA adapter. This creates distinct, cost-effective agent "personalities" on top of a shared base model. |
| L2 | Full-Model | The entire model's weights are optimized for a specific role using reinforcement learning. This creates a highly specialized expert agent for maximum performance on complex tasks. |
Quick Links¶
- Installation Guide - Get started in minutes
- Quick Start Tutorial - Run your first training
- Core Concepts - Understand the framework
- Training Guides - Train on different tasks
- API Reference - Detailed API documentation
๐ Acknowledgements¶
This work was primarily conducted by Yujie Zhao during her summer internship at Intel Corporation. We gratefully acknowledge Intel's support and resources that made this research possible.
- VERL: VERL: Efficient RL Training for LLMs - For efficient distributed RL training infrastructure
- RLLM: RLLM: Reinforcement Learning with Language Models - For foundational RL algorithms for LLMs
๐ License¶
Released under the MIT license. See LICENSE for details.