arms-automatic-reward-shaping-marl - SKILL.md Agent Skill

name: arms-automatic-reward-shaping-marl description: ARMS (Automatic Reward-shaping in Multi-agent Systems) — self-supervised reward shaping for sparse-reward MARL with Nash equilibrium preservation guarantees.

ARMS: Automatic Reward Shaping for Multi-Agent RL

Overview

Self-supervised reward shaping framework for Multi-Agent Reinforcement Learning (MARL) that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Preserves Nash equilibria under conditional best-response reasoning.

Core Methodology

Problem

Sparse rewards in MARL induce non-stationarity and make reward design delicate
Single-agent trajectory-ranking guarantees don't transfer to MARL
Need to preserve strategic structure, not just improve short-term optimization

Solution: ARMS Framework

Trajectory Ranking: Rank trajectories by sparse environmental reward
Shaping Reward Learning: Learn dense shaping signal from rankings (self-supervised)
Conditional Best-Response Reasoning: Reformulate policy invariance for MARL
Alternating Training: Policy learning ↔ Reward learning with shared parameters

Key Theorem

If shaping rewards satisfy conditional best-response conditions, they preserve each agent's best-response set under fixed opponent policies → preserve Nash equilibrium set.

Implementation Steps

Collect trajectories under sparse environmental rewards
Rank trajectories by final reward
Learn shaping reward function via trajectory ranking loss
Alternate between policy update (with shaping) and shaping reward update
Monitor for oscillation failure mode; increase exploration if detected

Applications

Sparse-reward multi-agent coordination tasks
Dec-POMDPs with delayed rewards
Cooperative multi-agent navigation
Team-based RL tasks

Pitfalls

Oscillation Failure Mode: Coupled policy-reward dynamics can oscillate
Mitigation: Increase exploration to stabilize dynamics
Don't: Apply single-agent reward shaping guarantees directly to MARL

Related Skills

[[gcpo-cooperative-policy-optimization]] — cooperative policy optimization replaces winner-takes-all
[[distributed-zeroth-order-marlh]] — distributed zeroth-order MARL

Activation Keywords

ARMS, automatic reward shaping, MARL, multi-agent reward design, sparse reward MARL, trajectory ranking, Nash equilibrium preservation, conditional best-response

Source

arXiv:2605.23562 — ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning