reinforcement-learning - SKILL.md Agent Skill

name: reinforcement-learning description: This skill should be used when the user asks about "reinforcement learning", "RL", "reward function", "policy gradient", "PPO", "SAC", "DQN", "Q-learning", "actor-critic", "MDP", "Markov decision process", "environment design", "reward shaping", "exploration strategy", "experience replay", "multi-agent RL", "RLHF", "reward hacking", or when training an agent to interact with an environment to maximize cumulative reward. version: 1.0.0

Reinforcement Learning — Full-Stack RL Engine

Provides the complete framework for designing, training, and evaluating reinforcement learning agents, from MDP formalization through policy optimization to deployment in production control systems.

MDP Framework

Every RL problem is formalized as a Markov Decision Process (S, A, P, R, γ):

Component	Description	Design Considerations
S (State Space)	Observation the agent receives	Ensure Markov property; normalize to [0,1] or z-score
A (Action Space)	Controls available to the agent	Discrete → DQN/PPO; Continuous → SAC/PPO
P (Transition)	Dynamics of the environment	Usually unknown, learned through interaction
R (Reward)	Signal encoding the objective	Most critical design decision; see Reward Shaping below
γ (Discount)	Future reward discount factor	0.99 for long-horizon tasks; 0.95 for shorter episodes

Algorithm Selection Guide

Scenario	Algorithm	Key Advantage
Discrete actions, off-policy, sample efficient	DQN (+ PER + Dueling)	Replay buffer enables high sample efficiency
Continuous control, off-policy	SAC (Soft Actor-Critic)	Maximum entropy RL; stable training, automatic α tuning
Discrete or continuous, on-policy, stable	PPO	Clipped objective prevents destructive policy updates
Partial observability, memory required	PPO + LSTM backbone	Recurrent policy handles non-Markovian observations
Human preference alignment	RLHF (PPO + reward model)	Reward from human comparisons, not engineering

Reward Shaping

The reward function is the most consequential RL design decision:

Potential-based shaping (safe — guarantees policy invariance):

F(s, a, s') = γ·Φ(s') − Φ(s) where Φ is a potential function
Does not change the optimal policy of the original MDP (Ng et al., 1999)

Common shaping components:

Progress reward: proportional to movement toward the goal state
Constraint penalty: large negative reward for safety violations
Smoothness reward: penalize jerky or oscillatory actions (robotics)
Curiosity bonus: intrinsic motivation for exploration in sparse-reward tasks

Red flags (reward hacking indicators):

Reward that can be maximized by a degenerate policy (e.g., agent finds a bug in the simulator)
Reward achievable without learning the intended behavior

See references/policy-gradient.md for PPO/SAC mathematical derivations and implementation details. See references/value-based-methods.md for DQN, Double DQN, Dueling DQN, and Rainbow implementations. See references/mdp-framework.md for environment design patterns and Gym interface specifications.

Training Stability Guidelines

Apply all of the following to any RL training run:

Gradient clipping (max_grad_norm = 0.5): prevents catastrophic policy updates
Observation normalization: running mean/variance normalization across episodes
Reward normalization: divide rewards by running standard deviation (not mean)
GAE (Generalized Advantage Estimation): λ = 0.95 for PPO — balances bias-variance in advantage estimates
Entropy regularization: encourages exploration; adjust coefficient during training
Multiple seeds: always report results across 5+ random seeds with mean ± std error

Evaluation Protocol

Run 100 deterministic evaluation episodes (ε = 0 for DQN, deterministic policy for SAC/PPO).
Report: mean reward, std, min/max, success rate (if applicable), episode length.
Compare against baselines: random policy, hand-coded heuristic, previous best agent.
Learning curve: episode reward vs. total environment steps (not wall-clock time).