Skip to content

PettingLLMs

PettingLLMs Logo

Reinforcement Learning Framework for Multi LLM Agents ๐Ÿš€๐ŸŒŸ

PettingLLMs Overview

Overview

PettingLLMs is an open-source framework for on-policy reinforcement learning (RL) with multi-agent large language models (LLMs).

It implements AT-GRPO (Agent- and Turn-wise Group Relative Policy Optimization), a novel algorithm and system design for training collaborative LLM agents across planning, coding, and mathematical reasoning tasks.

๐Ÿ“š Documentation Directory

Supported Training Modes

This framework supports:

  • โœ… Single-agent RL training
  • โœ… Multi-agent RL training (role-sharing policy)
  • โœ… Multi-agent RL training (role-specialized policies using different LoRA adapters or different LLMs)

๐Ÿ“ฐ News

  • [2025.10] ๐Ÿš€ GitHub repository open-sourced and publicly available
  • [2025.10] ๐ŸŽ‰ Paper released! Check out our arxiv preprint
  • [2025.10] ๐Ÿ”ฅ Support for different LoRA adapters per agent role - enabling efficient role-specialized training
  • [2025.09] ๐ŸŒ Multi-environment support added: Game (Sudoku, Sokoban), Code (APPS, CodeContests), and Math (AIME, OlympiadBench)
  • [2025.08] ๐Ÿค– Multi-agent framework implementation: support for both shared single model and role-specific models

๐Ÿš€ Key Features

  • Multi-Level Agent Specialization: Train and specialize agents at any level, from lightweight prompt adjustments to full model fine-tuning with LoRA or reinforcement learning.
  • Novel RL Algorithm: Implements Agent- and turn wise GRPO- AT-GRPO for efficient and stable multi-agent training.
  • Built-in Multi-Turn MAS Workflows: Comes with predefined, reproducible benchmarks and environments for a variety of domains:
  • ๐ŸŽฎ Games: Sudoku (4x4), Sokoban (6x6)
  • ๐Ÿ“ Planning: Plan-Path (10x10 grid)
  • ๐Ÿ’ป Coding: APPS, CodeContests, LiveCodeBench
  • ๐Ÿ”ข Math: AIME24/25, OlympiadBench

๐Ÿšฉ Roadmap

  • More Environments: Verilog design, web search, robotics, database query, scientific discovery
  • Multi-Modal Support: Vision-language models, audio processing, mixed-modal tasks
  • Agentic Framework Integration: AutoGen, LangGraph, CrewAI, and custom framework APIs

๐Ÿ“Š Key Results

PettingLLMs performance

Table 3 ยท Ablation on Plan-Path (Qwen3-1.7B)

Method Acc.(%) ฮ”
Single agent 5.00 โ€“
Training tool agent in SA, eval in SA 11.00 +6.00
Training code agent in SA, eval in SA 14.50 +9.50
Training in SA, eval in MAS 16.00 +11.00
MAS RL (role specific policies), eval in MAS 96.00 +91.00
w/ Swapped Policies 6.00 +1.00

๐Ÿ” Environment Workflows (MA vs. SA)

PettingLLMs worker

๐Ÿ“ฆ Installation

git clone https://github.com/pettingllms-ai/PettingLLMs.git
cd PettingLLMs
bash setup.bash

๐ŸŽฏ Quick Start

1. Dataset Preparation

Prepare datasets for different tasks:

# Code tasks (APPS, CodeContests, LiveCodeBench)
python scripts/dataprocess/load_code.py

# Math tasks (AIME24/25, OlympiadBench)
python scripts/dataprocess/load_math.py

# Game/Planning tasks (Sokoban, Sudoku)
python scripts/dataprocess/load_sokoban.py

Datasets will be saved to datasets/code/, datasets/math/, and datasets/sudoku_environments/.

2. Training

Example: Train multi-agent system on math tasks

bash scripts/train/math/math_L1_prompt.sh

Other training scripts available in scripts/train/: - code_single_policy.sh, code_two_policy.sh - Code domain - plan_path_single.sh, plan_path_two_policy.sh - Planning domain - sokoban_two_policy.sh, sokodu_single.sh - Game domain

3. Evaluation

Example: Evaluate trained model

Edit scripts/evaluate/evaluate.sh to set your model path and config:

MODEL_PATHS=("/path/to/your/model")
CONFIG_NAME="math_single_policy"

Then run:

bash scripts/evaluate/evaluate.sh

๐Ÿงฑ Three Levels of Agent Specialization

PettingLLMs uses a tiered approach to define agent roles, ranging from simple instructions to deep model specialization.

Level Role Specialization Method Description
L0 Shared model Roles are defined solely through instructions in the prompt. The base model is identical for all agents, offering a flexible but performance-limited baseline.
L1 LoRA Each role is specialized using a unique, lightweight LoRA adapter. This creates distinct, cost-effective agent "personalities" on top of a shared base model.
L2 Full-Model The entire model's weights are optimized for a specific role using reinforcement learning. This creates a highly specialized expert agent for maximum performance on complex tasks.

๐Ÿ”— Acknowledgements

This work was primarily conducted by Yujie Zhao during her summer internship at Intel Corporation. We gratefully acknowledge Intel's support and resources that made this research possible.

๐Ÿ“Œ License

Released under the MIT license. See LICENSE for details.