Zing Forum

Reading

Adversarial Coevolution: An Innovative Framework for Training PPO Agents with LLMs as Opponents

An open-source project combining reinforcement learning (RL) and large language models (LLMs), which achieves a 99.12% win rate by training PPO agents against LLM opponents in the Gin Rummy card game, demonstrating the potential of knowledge distillation and curriculum learning in complex incomplete information environments.

强化学习PPO大型语言模型LLM课程学习知识蒸馏Gin Rummy不完全信息博弈对抗训练Stable Baselines 3
Published 2026-05-30 06:13Recent activity 2026-05-30 06:22Estimated read 9 min
Adversarial Coevolution: An Innovative Framework for Training PPO Agents with LLMs as Opponents
1

Section 01

Introduction to the Adversarial Coevolution Framework: Innovative Exploration of LLM-Assisted PPO Agent Training

This article introduces an open-source project combining reinforcement learning (RL) and large language models (LLMs). The core is an adversarial coevolution framework that trains PPO agents against LLM opponents in the Gin Rummy card game, achieving a 99.12% win rate. The project demonstrates the potential of knowledge distillation and curriculum learning in complex incomplete information environments, providing a new paradigm for RL training. Developed by the Nikelroid team, the project is open-sourced on GitHub (link: https://github.com/Nikelroid/adversarial-coevolution), created in September 2025 and updated in May 2026.

2

Section 02

Project Background and Motivation

In the RL field, training high-performance agents often faces problems such as lack of reliable opponents or expensive human feedback. Traditional self-play tends to fall into local optima, leading to single strategies. The Nikelroid team proposes an adversarial coevolution framework, using LLMs as zero-shot strategy opponents to guide PPO agent learning. Core insight: LLMs possess extensive common-sense strategic knowledge and can serve as 'teachers' to provide diverse adversarial experiences. The project chooses Gin Rummy (a classic incomplete information game) for validation, showing how to distill LLM's semantic understanding capabilities into efficient neural network strategies.

3

Section 03

Technical Architecture and Core Components

The project adopts a three-module decoupled architecture:

  1. PPO Agent: Implemented based on Stable Baselines3 and PyTorch, with a custom PPO algorithm that supports effective action masking to handle complex action spaces, optimized for incomplete observation environments (processing hidden information and probabilistic reasoning).
  2. LLM Agent: Converts game states into Chain-of-Thought prompts through prompt engineering, supports models like Llama3, Gemma, GPT, integrated via Ollama and HuggingFace API, providing action selection and rich learning signals.
  3. Curriculum Learning Orchestrator: An innovative three-stage curriculum (random opponent → self-play → adversarial LLM), manages model pool API (RAM caching, dynamic opponent switching), and supports a multi-process training pipeline with 64-96 cores.
4

Section 04

Key Technical Implementation Details

  • Curriculum Learning Engineering Challenges: Design a fully cached RAM model pool API to avoid frequent loading overhead; intelligently switch opponent types based on win rate thresholds during training to ensure moderate challenges.
  • Knowledge Distillation Mechanism: Adopts adversarial distillation; RL agents observe the behavioral patterns of LLM opponents to internalize strategic intuition, which aligns better with the exploration-exploitation nature of RL than direct imitation.
  • Evaluation Environment: Developed a Gin Rummy evaluation environment based on the PettingZoo framework, supporting a web interface for human-agent and agent-agent battles to verify strategy generalization capabilities.
5

Section 05

Experimental Results and Performance

The project's experimental results are as follows:

Agent Type Opponent Win Rate Key Observations
PPO (Baseline) Random 98.9% High win rate but biased towards Gin strategy (local optimum)
PPO (Curriculum Learning) Random 99.12% Balanced strategy (Knock vs Gin)
GPT-OSS (20B) Random 100% Zero-shot performance (5-0 matches)
GPT-OSS (20B) PPO (Knock) 60% Competitive matches (3-2 score)
Key findings: PPO agents after curriculum learning show improved win rates and more balanced strategies, breaking through local optima and verifying the effectiveness of LLM adversarial training.
6

Section 06

Practical Application Value and Insights

  • RL Training Paradigm: LLMs can serve as a 'cheap yet powerful' alternative to opponents, suitable for complex fields like financial trading and cybersecurity where expert demonstrations are hard to obtain.
  • New Dimension of Knowledge Distillation: Demonstrates a cross-modal distillation path (from general LLMs to specialized strategy networks), applicable to scenarios where semantic knowledge is converted into action strategies.
  • Incomplete Information Games: Validation in Gin Rummy shows that LLM-assisted training has unique advantages in handling hidden information and probabilistic reasoning.
7

Section 07

Limitations and Future Directions

Limitations:

  1. Computational Cost: LLM inference cost is higher than pure self-play; need to balance budget and performance.
  2. Generalization: Only validated in Gin Rummy; performance in other complex games needs testing.
  3. LLM Dependence: Performance is affected by LLM's strategic capabilities; model differences need further research. Future Directions: Expand to multi-agent collaboration scenarios, explore efficient offline distillation methods, and validate in other incomplete information games like poker/bridge.
8

Section 08

Summary and Core Points

The adversarial coevolution framework integrates a new paradigm of symbolic reasoning (LLM) and neural decision-making (RL), using LLMs as 'strategic mentors' rather than sources of supervision signals to achieve more balanced and robust strategy learning. Key insights: When RL introduces external knowledge sources, adversarial training can stimulate exploration capabilities better than supervised learning; the three-stage curriculum design provides a reusable template. The project's open-source implementation (training pipeline, evaluation environment, web interface) provides an experimental platform for the community, promoting the development of LLM-assisted RL.